[
{
"title": "First Post, An Explication",
"excerpt": "A solemn oath to the reader… An oath I will surely forget and fail to uphold. Take it as you will\n",
"content": "This is my first blog, so I guess I should start off by explaining my motives behind writing what will probably be a sporadically updated blog. Basically, as I learn things that I find pretty difficult to find online, I’ll try to explain them as best I can here. Also, since I enjoy learning math, I’m going to try to keep up a (semi) regular stream of math posts. If you have any questions, feel free to contact me. My up-to-date contact info can be found on my website aaron.niskin.org ",
"url": "/blog/2015/11/29/first-post-an-explication.html"
},
{
"title": "What's in a language?",
"excerpt": "The basics of the theory of formal languages\n",
"content": "Languages in abstraction This post is about Languages from a mathematical and abstract linguistics point of view. Not much more to say about that, so let’s get right to it! The Rigorous definition: Let \\( \\Sigma \\) be an alphabet and let \\( \\Sigma^k \\) be the set of all strings of length k over that alphabet. Then, we define \\( \\Sigma^* \\) to be \\( \\bigcup\\limits_{k\\in\\mathbb{N}}\\Sigma^k \\) (the union of ∑k over all natural numbers k). If \\( L\\subseteq\\Sigma^* \\) , we call \\( L \\) a language. The Intuition Behind the Definition Consider an alphabet (some finite set of characters), for example we can consider the letters of the English language, the ASCII symbols, the symbols \\( {0, 1} \\) (otherwise known as binary), or the symbols \\( {1, 2, 3, 4, 5, 6, 7, 8, 9, 0, +, \\times , =} \\) . We can then construct the infinite list of all the different ways we can arrange those characters (e.g. \\( 1001011 \\) or \\( 0011011011 \\) , etc. if we’re using binary). We call these arrangements “strings”. Once we have all that machinery built up, a language is just some subset of that infinite collection of strings. The language may itself be infinite. Some Examples The alphabet: \\( \\Sigma={0, 1} \\) The language: \\( {x\\in\\Sigma|x \\text{ is prime}} \\) (all prime numbers in binary) Some strings from the language: \\( 10, 11, 101… \\) The alphabet: ASCII characters The language: All syntactically correct Clojure programs (the source code) The alphabet: All Clojure functions, operators, etc, and list \\( {x, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0} \\) The language: All syntactically correct Clojure programs (the source code) You see that we need to have an alphabet before we can have a Formal Language. Also, different alphabets may result in equivalent languages – by equivalent, we mean that both languages contain the same strings. What are we getting at? Well, there are two ways to look at this. On the first hand, linguists would like to study language in its abstract essence. For this, Formal Languages may come in handy (if endowed with a Grammar and possibly more). That is not the reason I will be studying Formal Languages. I’m learning about formal languages to find their applications to computing. Apparently, with the help of a little Mathematical thinking, we can assign semantics to the strings in a language and somehow correlate them with real world problems – such as computability, the P=nP problem, cryptography, and more! ",
"url": "/blog/2015/11/29/whats-in-a-language.html"
},
{
"title": "A dirty little ditty on Finite Automata",
"excerpt": "A basic yet formal introduction to Finite Automata and how strings act on them.\n",
"content": "This post builds on the previous post about Formal Languages. Some Formal Definitions A Deterministic Finite Automata (DFA) is A set \\( \\mathcal{Q} \\) called “states” A set \\( \\Sigma \\) called “symbols” or “alphabet” A function \\( \\delta_F:\\mathcal{Q}\\times\\Sigma \\to \\mathcal{Q} \\) A designated state \\( q_0\\in\\mathcal{Q} \\) called the start point A subset \\( F\\subseteq\\mathcal{Q} \\) called the “accepting states” The DFA is then often referred to as the ordered quintuple \\( A=(\\mathcal{Q},\\Sigma,\\delta_F,q_0,F) \\). Defining how strings act on DFAs. Given a DFA, \\( A=(\\mathcal{Q}, \\Sigma, \\delta, q_0, F) \\), a state \\( q_i\\in\\mathcal{Q} \\), and a string \\( w\\in\\Sigma^* \\), we can define \\( \\delta(q_i,w) \\) like so: If \\( w \\) only has one symbol, we can consider \\( w \\) to be the symbol and define \\( \\delta(q_i,w) \\) to be the same as if we considered \\( w \\) as the symbol If \\( w=xv \\), where \\( x\\in\\Sigma \\) and \\( v\\in\\Sigma^* \\), then \\( \\delta(q_i, w)=\\delta(\\delta(q_i,x),v) \\) And in this way, we have defined how DFAs can interpret strings of symbols rather than just single symbols. The language of a DFA Given a DFA, \\( A=(\\mathcal{Q}, \\Sigma, \\delta, q_0, F) \\), we can define “the language of \\( A \\)”, denoted \\( L(A) \\), as \\( {w\\in\\Sigma^*|\\delta(q_0,w)\\in F} \\). Some Examples, Maybe? Example 1: Let’s construct a DFA that accepts only strings beginning with a 1 that, when interpreted as binary numbers, are multiples of 5. So some examples of strings that would be in \\( L(A) \\) are 101, 1010, 1111 Some More Formal Definitions A Nondeterministic Finite Automata (NFA) is A set \\( \\mathcal{Q} \\) called “states” A set \\( \\Sigma \\) called “symbols” A function \\( \\delta_N:\\mathcal{Q}\\times\\Sigma \\to \\mathcal{P}\\left(\\mathcal{Q}\\right) \\) A designated state \\( q_0\\in\\mathcal{Q} \\) called the start point A subset \\( F\\subseteq\\mathcal{Q} \\) called the “accepting states” The NFA is then often referred to as the ordered quintuple \\( A=(\\mathcal{Q},\\Sigma,\\delta_N,q_0,F) \\). Defining how strings act on NFAs. Given an NFA, \\( N=(\\mathcal{Q}, \\Sigma, \\delta, q_0, F) \\), a collection of states \\( \\mathcal{S}\\subseteq\\mathcal{Q} \\), and a string \\( w\\in\\Sigma^* \\), we can define \\( \\delta(\\mathcal{S},w) \\) like so: If \\( w \\) only has one symbol, then we can consider \\( w \\) to be the symbol and define \\( \\delta(\\mathcal{S},w):=\\bigcup\\limits_{s\\in\\mathcal{S}}\\delta(s,w) \\) If \\( w=xv \\), where \\( x\\in\\Sigma \\) and \\( v\\in\\Sigma^* \\), then \\( \\delta(\\mathcal{S}, w)=\\bigcup\\limits_{s\\in\\mathcal{S}}\\delta(\\delta(s,x),v) \\) And in this way, we have defined how NFAs can interpret strings of symbols rather than just single symbols. Maybe in another post, we’ll get past the definitions! :-/ ",
"url": "/blog/2015/12/03/a-dirty-little-ditty-on-finite-automata.html"
},
{
"title": "Probability -- A Measure Theoretic Approach",
"excerpt": "We cover probability from a measure theoretic approach\n",
"content": "Probability using Measure Theory A mathematically rigorous definition of probability, and some examples therein. The Traditional Definition: Consider a set \\( \\Omega \\) (called the sample space ), and a function \\( X:\\Omega\\rightarrow\\mathbb{R} \\) (called a random variable . If \\( \\Omega \\) is countable (or finite), a function \\( \\mathbb{P}:\\Omega\\rightarrow\\mathbb{R} \\) is called a probability distribution if it satisfies the following 2 conditions: For each \\( x \\in \\Omega \\), \\( \\mathbb{P}(x) \\geq 0 \\) If \\( A_i\\cap A_j = \\emptyset \\), then \\( \\mathbb{P}(\\bigcup\\limits_0^\\infty A_i) = \\sum\\limits_0^\\infty\\mathbb{P}(A_i) \\) And if \\( \\Omega \\) is uncountable, a function \\( F:\\mathbb{R}\\rightarrow\\mathbb{R} \\) is called a probability distribution or a cumulative distribution function if it satisfies the following 3 conditions: For each \\( a,b\\in\\mathbb{R} \\), \\( a < b \\rightarrow F(a)\\leq F(b) \\) \\( \\lim\\limits_{x\\to -\\infty}F(x) = 0 \\) \\( \\lim\\limits_{x\\to\\infty}F(x) = 1 \\) The Intuition: What idea are we even trying to capture with these seemingly disparate definitions for the same thing? Well, with the two cases taken separately it's somewhat obvious, but they don't seem to marry very well. The discrete case is giving us a pointwise estimation of something akin to the proportion of observations that should correspond to a value (in a perfect world). The continuous case is the same thing, but instead of corresponding to that particular value (which doesn't really even make sense in this case), the proportion corresponds to the point in question and everything less than it. The shaded region in the top picture below and the curve in the picture directly below it denote the cumulative density function of a standard normal distribution (don't worry too much about what that means for this post, but if you're doing anything with statistics, you should probably know a bit about that). Another way to define a continuous probability distribution is through something called a probability density function, which is closer to the discrete case definition of a probability distribution (or probability mass function ). A probability density function is a function \\( f:\\mathbb{R}\\rightarrow\\mathbb{R}+ \\) such that \\( \\int{-\\infty}^xf(t)dt = F(x) \\). In other words, \\( \\frac{dF}{dX} = f \\). This new function has some properties of our discrete case probability function, but lacks some others. On the one hand, they’re both defined pointwise, but on the other, this one can be greater than one in some places – meaning the value of the probability density function isn’t really the probability of an event, but rather (as the name “suggests”) the density therein. Does it measure up? Now let’s check out the measure theoretic approach… Let \\( \\Omega \\) be our sample space, \\( S \\) be the \\( \\sigma \\)-algebra on \\( \\Omega \\) (so \\( S \\) is the collection of measurable subsets of \\( \\Omega \\)), and \\( \\mu:S\\to\\mathbb{R} \\) a measure on that measure space. Let \\( X:\\Omega\\rightarrow\\mathbb{E} \\) be a random variable (\\( \\mathbb{E} \\) is generally taken to be \\( \\mathbb{R} \\) or \\( \\mathbb{R}^n \\)). We define the function \\( \\mathbb{P}:\\mathcal{P}(\\mathbb{E})\\rightarrow\\mathbb{R} \\) (where \\( \\mathcal{P}(\\mathbb{E}) \\) is the powerset of \\( \\mathbb{E} \\) – the set of all subsets) such that if \\( A\\subseteq\\mathbb{E} \\), we have \\( \\mathbb{P}(A)=\\mu(X^{-1}(A)) \\). We call \\( \\mathbb{P} \\) a probability distribution if the following conditions hold: \\( \\mu(\\Omega) = 1 \\) for each \\( A\\subseteq\\mathbb{E} \\) we have \\( X^{-1}(A)\\in S \\). Why do this? Well, right off the bat we have a serious benefit: we no longer have two disparate definitions of our probability distributions. Furthermore, there is the added benefit of having a natural separation of concerns: the measure \\( \\mu \\) determines the what we might intuitively consider to be the probability distribution while the random variable is used to encode the aspects of the events that we care about. To further illustrate this The Examples A fair die All even Let’s consider a fair die. Our sample space will be \\( {1,2,3,4,5,6} \\). Since our die is fair, we’ll define our measure fairly: for any \\( x \\) in our sample space, \\( \\mu({x}) = \\frac{1}{6} \\). If we want to know, for instance, what the probability of getting each number is, we could use a very intuitive random variable \\( X(a) = a \\) (so \\( X(1)=1 \\), etc.). Then we see that \\( \\mathbb{P}({1}) = \\mu({1}) = \\frac{1}{6} \\), and the rest are found similarly. Odds and Evens? What if we want to consider the fair die of yester-paragraph, but we only care if the face of the die shows an odd or an even number? Well, since the actual distribution of the die hasn’t changed, we won’t have to change our measure. Instead we’ll change our random variable to capture just those aspects we care about. In particular, \\( X(a) = 0 \\) if \\( a \\) is even, and \\( X(a) = 1 \\) if \\( a \\) is odd. We then see \\( \\mathbb{P}(1) = \\mu({1,3,5}) = \\frac{1}{2} \\) and \\( \\mathbb{P}(0) = \\mu({2,4,6}) = \\frac{1}{2} \\) Getting loaded All even Now let’s consider the same scenario of wanting to know the probability of getting each number, but now our die is loaded. Being as how we’re changing the distribution itself and not just the aspects we’re choosing to care about, we’re going to want to change the measure this time. For simplicity, let’s consider a kind of degenerate case scenario. Let our measure be: \\( \\mu(A) = 1 \\) if \\( 1\\in A \\) and \\( \\mu(A)=0 \\) if \\( 1\ otin A \\). Basically, we’re defining our probability to be such that the only possible outcome is a roll of 1. So since we are concerned with the same things we were concerned with last time, we can take that same random variable. We note \\( \\mathbb{P}(1) = 1 \\) and \\( \\mathbb{P}(a) = 0 \\) for any \\( a \ eq 1 \\). Odds or evens Try to do this one yourself. I’m going to go get some sleep now. Please feel free to contact me with any questions. I love doing this stuff, so don’t be shy! ",
"url": "/blog/2017/02/18/probability-a-measure-theoretic-approach.html"
},
{
"title": "Linear Regression -- The Basics",
"excerpt": "Linear Regression is the bedrock of Machine Learning. We cover the basics therein, some theory involved, and give some relevant examples.\n",
"content": "The basics Yeah. It’s not a good sign if I’m starting out already repeating myself. But that’s how things seem to be with linear regression, so I guess it’s fitting. It seems like every day one of my professors will talk about linear regression, and it’s not due to laziness or lack of coordination. Indeed, it’s an intentional part of the curriculum here at New College of Florida because of how ubiquitous linear regression is. Not only is it an extremely simple yet expressive formulation, it’s also the theoretical basis of a whole slew of other tactics. Let’s just get right into it, shall we? Linear Regression: Let’s say you have some data from the real world (and hence riddled with real-world error). A basic example for us to start with is this one: There’s clearly a linear trend there, but how do we pick which linear trend would be the best? Well, one thing we could do is pick the line that has the least amount of error from the prediction to the actual data-point. To do that, we have to say what we mean by “least amount of error”. For this post, we’ll calculate that error by squaring the difference between the predicted value and the actual value for every point in our data set, then averaging those values. This standard is called the Mean-Squared-Error (MSE). We can write the MSE as: where \\( \\hat Y_i\\) is our predicted value of \\(Y_i\\) for a give \\(X_i\\). Being as how we want a linear model (for simplicity and extensibility), we can write the above equation as, for some \\( \\alpha, \\beta \\) that we don’t yet know. But since we want to minimize that error, we can take some derivatives and solve for \\( \\alpha, \\beta \\)! Let’s go ahead and do that! We want to minimize We can start by finding the \\( \\hat\\alpha \\) such that, \\( \\frac{d}{d\\alpha}\\sum\\limits_{i=1}^N\\left(\\alpha + \\beta X_i - Y_i\\right)^2 = 0 \\). And as long as we don’t forget the chain rule, we’ll be alright… and we’ll find the \\( \\beta \\) such that \\( \\frac{d}{d\\beta}\\sum\\limits_{i=1}^N\\left(\\alpha + \\beta X_i - Y_i\\right)^2 = 0 \\) And following a similar pattern we find (sorry for the editing… Wordpress.com isn’t the greatest therein): But note: So, And: So, And then we can find \\( \\alpha \\) by substituting in our approximation of \\( \\beta \\). Using those coefficients, we can plot the line below, and as you can see, it really is a good approximation. Now we have it Okay, so now we have our line of “best fit”, but what does it mean? Well, it means that this line predicts the data we gave it with the least error. That’s really all it means. And sometimes, as we’ll see later, reading too much into that can really get you into trouble. But using this model we can now predict other data outside the model. So, for instance, in the model pictured above, if we were to try and predict \\( Y \\) when \\( X=2 \\), we wouldn’t do so bad by picking something around 10 for \\( Y \\). An example, perhaps? So I feel at this point, it’s probably best to give an example. Let’s say we’re trying to predict stock price given the total market price. Well, in practice this model is used to assess volatility, but that’s neither here nor there. Right now, we’re really only interested in the model itself. But without further ado, I present you with, the CAPM (Capital Asset Pricing Model): (where \\( \\epsilon \\) is the error in our predictions). And you can fit this using historical data or what-have-you. There are a bunch of downsides to fitting it with historical data though, like the fact that data from 3 days ago really doesn’t have much to say about the future anymore. There are plenty of cool things you can do therein, but sadly, those are out of the scope of this post. For now, we move on to Multiple Regression What is this anyway? Well, multiple regression is really just a new name for the same thing: how do we fit a linear model to our data given some set of predictors and a single response variable? The only difference is that this time our linear model doesn’t have to be one dimensional. Let’s get right into it, shall we? So let’s say you have \\( k \\) many predictors arranged in a vector (in other words, our predictor is a vector in \\( \\mathbb{R}^n \\)). Well, I wonder if a similar formula would work… Let’s figure it out… Firstly, we need to know what a derivative is in \\( \\mathbb{R}^n \\). Well, if \\( f:\\mathbb{R}^n\\to\\mathbb{R}^m \\) is a differentiable function, then for any \\( x \\) in the domain, \\( f’(x) \\) is the linear map \\( A:\\mathbb{R}^n\\to\\mathbb{R}^m \\) such that \\( \\text{lim}_{h\\to 0}\\frac{||f(x+h) - f(x) - Ah||}{||h||} = 0 \\). Basically, \\( f’(x) \\) is the tangent plane. So, now that we got that out of the way, let’s use it! We want to find the linear function that minimizes the Euclidean norm of the error terms (just like before). But note: the error term is \\( \\epsilon = Y - \\hat Y = Y - \\alpha -\\beta X \\), for some vector \\( \\alpha \\) and some matrix \\( \\beta \\). Now, since it’s easier and it’ll give us the same answer, we’re going to minimize the squared error term instead of just the error term (like we did in the one dimensional version). We’re also going to make one more simplification: That \\( \\alpha=0 \\). We can do this safely by simply appending (or prepending) a 1 to the rows of our data (thereby creating a constant term). So for the following, assume we’ve done that. So, let’s find the \\( \\beta \\) that minimizes that. So, now we see that the derivative is \\( -2Y^TX + 2\\beta^TX^TX \\) and we want to find where our error is minimized, so we want to set that derivative to zero: And there we have it. That’s called the normal equation for linear regression. Maybe next time I’ll post about how we can find these coefficients given some data using gradient descent, or some modification thereof. Till next time, I hope you enjoyed this post. Please, let me know if something could be clearer or if you have any requests. ",
"url": "/blog/2017/02/20/What-Is-Linear-Regression.html"
},
{
"title": "Markov Chain Monte Carlo",
"excerpt": "Markov Chain Monte Carlo is a tool for sampling otherwise intractible distributions. In this post, we’ll go through some descriptions, reasoning, and examples therein.\n",
"content": " Background What if we know the relative likelihood, but want the probability distribution? But what if \\( \\int f(x)dx \\) is hard, or you can’t sample from \\( f \\) directly? This is the problem we will be trying to solve. First approach If space if bounded (integral is between \\( a,b \\)) we can use Monte Carlo to estimate \\( \\int\\limits_a^b f(x)dx \\) Pick \\( \\alpha \\in (a,b) \\) Compute \\( f(\\alpha)/(b-a) \\) Repeat as necessary Compute the expected value of the computed values What if it’s a big ol’ bag of nope? What if we can’t sample from \\( f(x) \\) but can only determine likelihood ratios? Enter Markov Chain Monte Carlo The obligatory basics A Markov Chain is a stochastic process (a collection of indexed random variables) such that \\( \\mathbb{P}(X_n=x|X_0,X_1,…,X_{n-1}) = \\mathbb{P}(X_n=x|X_{n-1}) \\). In other words, the conditional probabilities only depend on the last state, not on any deeper history. We call the set of all possible values of \\( X_i \\) the state space and denote it by, \\( \\chi \\). Transition Matrix Let \\( p_{ij} = \\mathbb{P}(X_1 = j | X_0 = i) \\). We call the matrix \\( (p_{ij}) \\) the Transition Matrix of \\( X \\) and denote it \\( P \\). Let \\( \\mu_n = \\left(\\mathbb{P}(X_n=0), \\mathbb{P}(X_n=1),…, \\mathbb{P}(X_n=l)\\right) \\) be the row vector corresponding to the “probabilities” of being at each state at the \\( n \\)th point in time (iteration). Claim: \\( \\mu_{i+1} = \\mu_0P^{i+1} \\) Proof: if \\( i = 0 \\): So: \\( \\mu_1 = \\mu_0P \\) if \\( i > 0 \\): Then, \\( \\mu_{i+1} = \\mu_iP = (\\mu_0P^i)P = \\mu_0P^{i+1} \\) Stationary Distributions We say that a distribution \\( \\pi \\) is stationary if Main Theorem: An irreducible, ergotic Markov Chain \\( {X_n} \\) has a unique stationary distribution, \\( \\pi \\). The limiting distribution exists and is equal to \\( \\pi \\). And furthermore, if \\( g \\) is any bounded function, then with probability 1: This really just means that if our Markov Chain has certain properties (basically just that any state can be gotten to from any other state at any time), then we can sample from this Markov Chain, and it’ll have the same distribution as a generic sample from our desired distribution. Random-Walk-Metropolis - Hastings: Let \\( f(x) \\) be the relative likelihood function of our desired distribution. And \\( q(y|x_i) \\) the known distribution easily sampled from (generally taken to be \\( N(x_i,b^2) \\) ) 1) Given \\( X_0,X_1,…,X_i \\), pick \\( Y \\sim q(y|X_i) \\) 2) Compute \\( r(X_i,Y) = \\min\\left(\\frac{f(Y)q(X_i|Y)}{f(X_i)q(Y|X_i)}, 1\\right) \\) 3) Pick \\( a \\sim U(0,1) \\) 4) Set The Confidence Builder: We would like to sample from and obtain a histogram of the Cauchy distribution: Note: Since \\( q \\) is symmetric, \\( q(x|y) = q(y|x) \\). Pseudocode: MC = [] X_i = initial_state for i in range(int(num_iters)): \tY = q(X_i) \tr = min(f(Y)/f(X_i), 1) \ta = np.random.uniform() \tif a < r: \t\tX_i = Y \tMC.append(X_i) Estimation: So you can clearly see that the different values of \\( b \\) lead to wildly different distributions, and therefore wildly different approximations of the end distribution. For example, if the standard deviation on our sampling distribution is very small, our prospective transitions won’t venture out very far, and hence the tails of our distribution will be woefully underrepresented. If, on the other hand, the standard deviation is too large, we will often venture our much further than we should, and the we’ll end up with something like a multi-modal distribution. Basically, the prospective transition states will often be too far out to accept, in which case the current state persists, or it will be just close enough to be accepted. What you end up with is far too many samples from the tails and not enough samples from the beafier parts of the distribution. Choosing the “best” value of \\( b \\) is more of an art than a science, really. This issue will poke its ugly head a little later. But even with that issue, we see clearly that something amazing is happening here. How is this happening? A property called detailed balance, which means, or in the continuous case: The intuition Since what we’re really after is the distribution of the samples and not the order in which they were picked, it’s reasonable to require a notion of (what I’m going to call) equal hopportunity: In other words, the probability of “hopping” from \\( x \\) to \\( y \\) should be the same as the reverse, because at the end of the day, both of these two scenarios only mean that \\( x \\) and \\( y \\) are in our sample. If you work out the math therein (as we will do below), you find that this ratio guarantees equal hopportunity. Let’s prove it! To reiterate, the claim was that this detailed balance property ( \\( f(x)P_{xy} = P_{yx}f(y) \\) ) guarantees that our likelihood function \\( f \\) is the stable distribution for our Markov Chain. Let \\( f \\) be the desired distribution (in our example, it was the Cauchy Distribution), and let \\( q(y|x) \\) be the distribution we draw from. First we’ll show that detailed balance implies \\( \\pi \\) (or \\( f \\) if the distribution is continuous) is the stable distribution! Let \\( \\pi_ip_{ij} = \\pi_jp_{ji} \\) for discrete or for continuous \\( f(i)P_{ij}=P_{ji}f(j) \\) Now we’ll show that the Markov Chain defined by the Metropolis-Hastings algorithm has the detailed balance property (and hence \\( f \\) is the stable distribution). Without any loss of generality, we’ll assume \\( f(x)q(y|x) > f(y)q(x|y) \\) (if this is not the case, the proof will follow the same with just symbolic changes). Note: Since \\( f(x)q(y|x) > f(y)q(x|y) \\), we know, \\( r(x,y) = \\frac{f(y)q(x|y)}{f(x)q(y|x)} \\), and \\( r(y,x) = 1 \\). Then, Modeling Change Point Models in Astrostatistics. This was taken from a Penn State statistics summer program website located at stat.psu.edu. A detailed walk-through of the process PSU took with this project can be found at that site. The data is a time series of light emission recorded and aggregated over periods of 10000 seconds (just under 3 hours). A plot of the time series can be seen below for your convenience. The idea (with this data) is that the emissions can be modeled by two Poisson distributions and a change point. The data will be modeled by the first Poisson until the change point, and then it switches to the second Poisson. So we start out with a multi-parameter model, and through some Bayesian inference you arrive at a pretty nasty looking posterior distribution. You’d like to sample this distribution to obtain, say, the mean \\( k \\) (change point) value along with a 95% confidence interval. Since you need that confidence interval, standard optimization methods won’t suffice: we’d need the distribution of \\( k \\)s. For this, we can use MCMC! The posterior distribution we would like to sample from is this: This distribution is pretty gnarly and since I don’t really know how to sample from this (beyond a blind search), we would like to use MCMC. Metropolis-Hastings (at least how we’ve discussed it) clearly doesn’t work here because this is multidimensional. For that we introduce another MCMC algorithm: Gibbs Sampling. Gibbs Sampling The basic pseudocode is this: X = initial_X Y = initial_Y ... Z = initial_Z MC = [(X,Y,...,Z)] for _ in range(num_iters): \t// for each variable we compute \tX ~ q_X(x|X,Y,...,Z) \t// We do this for X, Y, ..., Z in turn We can combine Gibbs with Metropolis quite easily: X = initial_X Y = initial_Y ... Z = initial_Z MC = [(X,Y,...,Z)] for _ in range(num_iters): \t// for each variable we compute \tprospect ~ q_X(x|X,Y,...,Z) \tr = min(f(prospect)/f(X), 1) \ta ~ uniform(0,1) \tif a < r: \t\tX = prospect \t// We do this for X, Y, ..., Z in turn So, basically, you update one at a time. Using this algorithm, we can analyze the data from the PSU website we mentioned above. In fact we did that very thing. Below you’ll find some graphs depicting the distribution of \\( k \\) values. Summary Random-Walk-Metropolis-Hastings Can be used to sample difficult univariate distributions relatively easily Have to tune the sampling parameter Curse of dimensionality in tuning parameters Requires \\( f \\) to be defined on all of \\( \\mathbb{R} \\) Transform as needed Gibbs Sampling Turn high dimensional sampling into iterative one-dimensional sampling Gibbs with Metropolis-Hastings Lovely Bibliography Summer School in Astrostatistics ",
"url": "/blog/2017/04/19/MCMC.html"
},
{
"title": "Decomposition of Autoregressive Models",
"excerpt": "We discuss the decomposition of AR models into components, and how eigenvalues are involved (because they always are).\n",
"content": "Autoregression Background This post will be a formal introduction into some of the theory of Autoregressive models. Specifically, we’ll tackle how to decompose an AR(p) model into a bunch of AR(1) models. We’ll discuss the interpretability of these models and howto therein. This post will develop the subject using what seems to be an atypical approach, but one that I find to be very elegant. The traditional way Let $x_t$ be an AR(p) process. So $x_t = \\sum\\limits_{i=1}^pa_ix_{t-i} + w_t$. We can express $x_t$ thusly: Where $L$ is the lag operator. We define the AR polynomial $\\Phi$ as $\\Phi(L) := \\left(1 - \\sum\\limits_{i=1}^pa_iL^i\\right)$. Then as long as we’re considering this polynomial to be over an algebraically closed field (like $\\mathbb{C}$), we can factor this polynomial. So Where $\\varphi_i$ are the roots of $\\Phi$. This polynomial will be the star of our show today. In fact, if we want to understand $\\Phi$ (the AR process we’re considering), we need only understand each $x_t = \\varphi_ix_{t-1} + w_t$ model. We’ll actually prove this later. We’ll be investigating several of its properties but first (just for fun) let’s investigate the conditions under which it’s invertible. We can reduce the problem of determining whether or not the AR polynomial is invertible to the problem of inverting each factor. After all, the polynomial is invertible if and only if each factor is. So let’s restrict our considerations to just a single factor for the time being. When will $x-\\lambda$ be invertible? Now, we recall the Taylor Series Expansion formula from our trusty calculus class. In general, we tend to use $a=0$ as a good arbitrary choice (the Maclaurin Series), resulting in: Well, if we consider the Taylor expansion of $\\frac{1}{x-\\lambda}$, we find Which is square summable if and only if $|\\lambda| > 1$. So we see that if the AR roots are all outside the unit circle, then the AR polynomial will be invertible (in such a case, the AR process is called causal). The new(?) approach Definition Let $X_t\\in \\mathbb{R}^n, A_i\\in\\mathbb{R}^{n\\times n}, B\\in\\mathbb{R}^{k_0\\times n}$, and $p\\in\\mathbb{N}_{>0}$. Also let, $W_t$ be $k_0$ dimensional white noise. Then, $X_t$ is called a VAR(p) process if, Basically a VAR(p) model is an analogue of AR(p) models where the scalars are matrices and the variable itself is a vector. The noise in such a model need not be independent per coordinate, nor restricted to only one. In fact, we can have any number of driving noise processes that get mixed together linearly into $X_t$. Hence cometh $B$ – the linear transformation from the driving noise space into the $X_t$ space. Furthermore, let’s say that we don’t directly observe $X_t$ directly, but rather we observe some $Y_t$ that’s a linear function of $X_t$ (plus some noise). Formally, we observe $Y_t$ s.t.: Where $Y_t \\in\\mathbb{R}^m, C\\in\\mathbb{R}^{m\\times n}, D\\in \\mathbb{R}^{k_1\\times m}$. And again, $U_t$ is $k_1$ dimensional white noise. Here $D$ serves the same purpose as $B$ did above, but for the observations themselves. This is called a “State Space” model. Examplificate How does this relate to our star (the AR polynomial $\\Phi$)? Let’s try to express an AR(p) process as a VAR(1) process… Let $x_t$ be an AR(p) process. That is to say, $x_t = \\sum\\limits_{i=1}^p a_ix_{t-i} + w_t$ where $w_t$ is some white noise process. Let $W_t$ be a 1 dimensional white-noise processes such that $W_t = w_t$. We define Then, we can see that the first coordinate of $X_t$ is always $x_t$. Our notation for this will be $X_t[0] = x_t$. Furthermore, let We can see that for $0 < i < p$, we have $(AX_{t-1})[i] = x_{t-i}$, and $(AX_{t-1})[0] = \\sum\\limits_{i=1}^p a_i x_{t-i} = x_t - w_t$. As it stands, $AX_{t-1}$ is almost $X_t$, just without the added noise in the first coordinate. So let’s add noise to only the first coordinate. To this end, let and Now we want $Y_t = x_t$. So we set That way, $Y_t = CX_t = x_t$ and all is right with the world. So yay! We’ve expressed this AR(p) process as a $p$ dimensional $VAR(1)$ process! So… Why? Why do? Well, one natural question to ask at this point is, what are the eigenvalues of $A$? This polynomial is the characteristic polynomial of $A$, so the roots of this polynomial are the eigenvalues of $A$. Let’s see how this relates to $\\Phi$… So the characteristic polynomial is really just $\\Phi(L)$ revisited! Furthermore, check this sweetness out… So we can see that the eigenvalues are actually intimately related with the roots (in fact they’re inverses of each other)! With that in mind, let’s see what happens when we try to diagonalize this matrix $A$. Firstly, let’s assume that $A$ matrix is diagonalizable. I.e., let where $H$ is a coordinate transformation (an invertible map) and $\\bar A$ is diagonal. Then So if we let $\\tilde X_t = HX_t$, we get So $\\tilde X_t$ is a bunch of AR(1) processes (because the matrix $\\bar A$ is diagonal), and, Which is to say that $Y_t$ is a linear combination of AR(1) processes! So we started out with a general AR(p) process, and found that expressing it as a VAR(1) model allows us to see that this is just a linear combination of AR(1) processes. I think that’s pretty cool. But let’s investigate this a bit further… What more?! As per our previous section, we have that $Y_t$, which was really just $x_t$ – our original AR(p) variable, is a linear combination of AR(1) models. More specifically, if we let $F := CH^{-1}$, then So indeed, $x_t$ is a linear combination of $p$ many AR(1) processes. In fact, since $\\tilde X_t = \\bar A \\tilde X_{t-1} + \\left(HB\\right)W_t$, we know the AR(1) models explicitly: Where $\\lambda_i$ is the $i$th eigenvalue of $A$ (hence the $(i,i)$th element of $\\bar A$). So we see that the AR(1) processes each have their respective roots from the roots of the original AR polynomial $\\Phi$. Pretty cool, right? Summary So we can express any AR(p) model as a linear combination of AR(1) processes (albeit with correlated noise terms), where each AR(1) process is determined by the roots of the AR polynomial (or the characteristic polynomial). We’ve essentially reduced the problem of studying AR(p) models to studying their eigenvalues (or AR roots). ",
"url": "/blog/2017/10/25/Autoregression.html"
},
{
"title": "Let's Cover Homology (pt 1)",
"excerpt": "What does topology have to do with data?! And what is this topology thing anyway?\n",
"content": "Topology What is it? Topology is essentially the study of proximity, which is a close analogue for shape. The specifics of how that’s done are not really the point of this post. That’s all for this section. How is it related to analyzing data? Most data analysis techniques are already about quantifying some geometric attributes. For instance, when you fit a linear model, you’re really just imposing an affine model on the data, and computing the appropriate parameters (slope and intercept). But these sorts of things work best if you have an idea for what an appropriate shape would be. That’s one area where topology might be able to help. But to hear more about that, you’ll have to wait for my next post. There is a notion of equivalence (called isomorphism) of topological spaces (bi-continuous bijections). Basically, we consider two topological spaces to be equivalent if one can be continuously transformed into the other and back. That’s actually a very powerful description that captures a lot of non-linearities and things like that. For instance, an egg shape, a sphere, a pyramid, and a tetrahedron are all isomorphic topological spaces, but not a doughnut (torus). And those kinda make sense. But there are some counter-intuitive examples. For instance, if you were to take one of those long clown balloons and tie it up into a poodle, it would also be isomorphic to the bunch above. And, as the joke goes, donuts are isomorphic to coffee cups. I know what you’re thinking. This topology stuff sounds super cool. So topology would make a wonderful (admittedly incomplete) description of data. But there are some problems: mainly that this whole story breaks down in higher dimensions. Luckily, there is an analogue that can be computed. Homology What is it? Homology is (in an overly-simplified sentence), counting the number of holes. That’s actually a lot more information than you might initially think. It’s actually sufficient to completely classify the space in lower dimensions, and a good approximation in higher ones. Either way, if we knew that info about our data, we’d capture a lot of the non-linear relationships. Either way, it’ll be fun, so let’s get into it. How to computes… Triangulation Simplices A simplex is a collection of $n$ points with the property that no one point lies in the subspace defined by the other $n-1$ points. The idea is that it’s the minimal number of points with an $n-1$ dimensional convex hull – or, said differently, the minimal number of points to make an $n-1$ dimensional shape. For instance, if we consider $\\mathbb{R}^2$ as our ambient space, then the only simplices that exist are 0, 1, or 2 simplices: the 0 simplices being points, 1 simplices being lines and 2 simplices being tetrahedra. If we look at these simplices like building blocks, we can now make something isomorphic to any shape just by glueing a bunch of these simplices together. But we’ll get to that in a bit. Let $\\{v_0,v_1,v_2\\}$ be a simplex. Firstly, since it’s a simplex with 3 points, it must be a 2-simplex. And since it’s a simplex, we know that $v_0-v_1\ otin span(\\{v_1-v_2,v_1+v_2\\})$ and the same relationship holds for all the rest of the points. This simplex is supposed to represent any 2 dimensional contiguous ball-like thing. For instance, since a filled-in circle can be continuously transformed into a filled-in triangle (which is, in essence what a 2-simplex is), our simplex becomes a proxy for such circles, and squares, and rhombi, and octagons, but not the Apple logo. But if you take that stem thing and throw it away so you’re only left with the main apple part of the logo, then yeah, that too. So now if we have any single contiguous $N$-D object without any holes, we can represent it topologically by a triangle. So simplices make the building blocks for general shapes, as we’ll see in a bit. Simplicial Complexes So let’s build a simplicial complex! But wait! I haven’t even defined what that is yet! Meh, let’s just roll with it. So given a bunch of things we can represent by simplices (like the full apple logo, for instance), we can construct a simplicial complex to represent it. Some more examples would be: an air-filled balloon (which we’ll consider to be hollow) on a string, a filled water balloon (which we’ll not consider to be hollow) on a string, or a doughnut. The list goes on, obviously, but I think you get the point. We can represent pretty much anything this way. The specifics really only come in when we try to construct the simplicial complex itself. In order to make the thing have the nice characteristics we’ll use later on, we’ll need to make sure our complexes have some properties. But first, we need to know what the face of a simplex is: Given a simplex $\\{v_0,v_1,…,v_n\\}$, a face is the simplex defined by a subset (e.g. $\\{v_1,v_2,…,v_n\\}$ is a face, and so is the original simplex). If you think of a tetrahedron, for instance, the faces are the four triangular faces we’re used to calling faces as well as the six lines on the edges, and the four individual points. So the faces are kinda similar to our intuitive notion of a face. Now we can define a simplicial complex. A simplicial complex $K$ is a collection of simplices with the following properties (see mathworld): Each face of a simplex in $K$ is itself a simplex in $K$ The intersection of two simplices in $K$ is itself a simplex (hence in $K$) It’s the second rule that’s really the crux of everything, as we’ll see. But let’s go through some examples… The Apple logo can be triangulated by two disjoint 2-simplices (along with all the appropriate faces of the two). So we basically forced the first property to hold, and since the two simplices are disjoint the second property trivially holds. The filled water-balloon on a string can be triangulated by a 3-simplex attached to a 1-simplex along with all the appropriate faces. The filled water-balloon part is clearly the 3-simplex, and the string is the 1-simplex. So let’s prove that the two properties hold. The first one holds trivially, since we defined it as the union of all the faces. The second property holds because if you take any two simplices in this complex, either they intersect at the point on top (a 0-simplex) or they don’t intersect at all! (Pardon the hand waving proof there). The (hollow) air-filled balloon, on the other hand, is a bit harder to triangulate, but we’ll barrel through it. We can glue four 2-simplices together to be like a hollowed out 3-simplex, then attach a 1-simplex as the string to one of the vertices of the hollow tetrahedron, and we’ll be done. All that’s left is to show the two properties hold. That’s left as an exercise to the reader (ah, the good life). The doughnut, on the other hand, is a lot more complicated to triangulate. To do that we’d have to specify if we’re talking about the shell of a doughnut or a filled in one. Although this is a fun exercise, the traditional way to triangulate the torus doesn’t have much to do with what we’re going to be using this stuff for, so I’ll take a more data-science type approach to this. If we have a bunch of data arranged in the shape of a torus, we can just take the union of the simplices defined by taking a point and the 3 nearest points to it, and creating the convex hull of those points (or considering them as a 3-simplex). Associations But it would be useful to triangulate the torus using the more traditional approach. To get you started on the right track, you’ll need another tool: association. An association is basically a way of considering things to be equal. It’s kinda like defining specific teleportation or something. I think it’s best to just work out an example. So let’s triangulate some things (like a sphere – the air-filled balloon, for example) but without using the fact that we’re in 3-D space and using associations instead. Let’s start with two 2-simplices joined at one edge (to make something that looks like a square). We then make the following association: we say anything along the line $(\\langle1\\rangle,\\langle2\\rangle)$ is associated with things on the line $(\\langle4\\rangle,\\langle3\\rangle)$ – where order is important. If we want to make it more formal, we could parameterize the line connecting $\\langle1\\rangle$ to $\\langle2\\rangle$ by a parameter $t$ such that $f_0(0) = \\langle1\\rangle$ and $f_0(1) = \\langle2\\rangle$. We describe this visually like so: We’re considering all the points on the top line to be associated with (or equal to) their corresponding points on the bottom line. You can imagine this to be kinda like pacman: when you go off the top edge, you come back from the bottom edge in the same direction (because of the association that we’ve made). So, we might as well fold the paper over and tape the two sides together so as to better visualize the association we’ve made. But when we do that, we notice we have a tube! So the association shown above can be described as a tube. Unfortunately, it’s not a simplicial complex though. This can be seen because if we call the top-left triangle $A$, and the bottom-right one $B$, the intersection of $A$ and $B$ is the diagonal line $(\\langle1\\rangle,\\langle3\\rangle)$ and the associated line (it’s only one because the two are associated – hence the same line) $(\\langle1\\rangle,\\langle2\\rangle) = (\\langle4\\rangle,\\langle3\\rangle)$. So the intersection is two lines, which is a simplicial complex, but not a simplex! So this triangulation doesn’t have the second property that we require all simplicial complexes to have. We could make it a simplicial complex by dividing the thing into a 9 square grid (so 18 triangles) and then doing the same thing again with the top being associated with the bottom, but it’s a lot to write out. Try it out though. See if you can make it a simplicial complex. If you run into any issues, feel free to email me. Moving on… Now if we just flipped the association, we get a möbius strip! It might be useful to actually break out a piece of paper and do this if you’ve never done it before. If we take the tube and add this one association it becomes a torus (a hollow doughnut)! Notice that we identify which sides are connected by the fact that they share the same symbol (the double arrow is connected to the other double arrow, and the single arrow does likewise). If you’re trying it out on paper, make sure you label the points so you can see them! If you reverse the direction one side of the second association, you get something very different – a Klein bottle! I feel like it’s probably a bit cruel of me to have put this after the picture, in case you’ve been trying to make the last one with your paper… It turns out, you can’t make the last one in 3D space. You need to either use an extra dimension or tear the paper, unfortunately. But, it’s interesting to see that flipping the association completely changes the thing – kinda like chirality in chemistry. Like if we take the möbius strip and make the following additional association, you get something called the Projective Plane (\\(\\mathbb{P}^2\\) – also not embeddable in 3D space). And finally, the following association leads us to a sphere, or, more precisely, the surface of a sphere. Then, since all the points on the top line (and the bottom, left, and right lines) are all associated, imagine contracting them while keeping the inside of the bag relatively unchanged (as far as area is concerned). So necessarily the bag will puff up like a bubble as the sides squish down to a smaller and smaller hole. Eventually, the edge becomes just a point, and what do we have? Yup, we’ve got a sphere. So that’s one way we can triangulate a sphere. The problem is that none of these are simplicial complexes. Why not? Can you fix these so that they become simplicial complexes? Summary So we’ve triangulated some things, defined what simplices and simplicial complexes are, but we haven’t yet found their utility. For that, just stay tuned! In the next episode we’ll go over Betti Numbers and how to compute them using these simplicial complexes! ",
"url": "/blog/2018/04/12/homology.html"
},
{
"title": "2 ropes and a matchbook",
"excerpt": "You’re given two ropes that take 60 minutes to burn and a matchbook, and you need to measure 45 minutes.\n",
"content": "The scenario: You’re given 2 ropes and a matchbook. You know a priori that these two ropes each take an hour to burn, but they don’t burn at an even rate. For instance, the first rope might take 1 minute to burn through the first half and 59 minutes to burn the second, and the other rope might be completely different. Can you measure 45 minutes using only these two ropes and the matchbook? What you know: You have two ropes and a matchbook. The ropes take an hour to burn completely. The rate of burn is inconsistent. Hints (click to unblur) Can you measure 30 minutes? You may have to do two things at once A solution A description (and strategy) You can measure 60 minutes easily by just burning a rope. You can measure 30 minutes by burning both ends of a rope. So you can measure 45 minutes by lighting two ends of one rope and one end of the other, then when the first rope burns out, light the other end of the second rope. It’ll be 45 minutes because the first rope takes 30 minutes to burn, so there will be 30 minutes left on the second rope when you light the second end, thereby turning the 30 minutes of rope into 15 minutes. ",
"url": "/blog/2018/06/02/2_ropes_60_min.html"
},
{
"title": "The 7 person game-show riddle",
"excerpt": "You and 7 friends are assigned numbers from 1-7 and you have to guess your own\n",
"content": "The scenario: You and six of your friends are going on a game show. When you get there, you’ll be placed in a green room and told the rules of the game so that you all can strategize. Once you all agree on a strategy, you’ll be taken on stage where you’ll each be put in your own respective isolation chambers. Once your whole team is in their respective chambers, you will each be assigned a number between 1 and 7 (inclusive and possibly repeating). You’ll all be shown everyone else’s numbers, but not your own. Your whole team wins if any one member can guess his or her own number correctly. What you know: Each member of your team is assigned a number between 1 and 7 inclusive and possibly repeating. Each member gets only one guess. There’s no penalty for wrong guesses. There’s absolutely no communication between members while the game is on. You all win if at least one person from your team guesses correctly. Can you come up with a winning strategy? Hints (click to unblur) Start with a simpler case (like only 2 players, then 3 players, etc.) Consider what is conserved. Consider modular arithmetic A solution A description Regardless of who’s looking at the numbers, the sum of all the numbers (including your own) must be the same. The problem is that you don’t know what the missing number is (your own). But, if we make an assumption as to what the sum is, we could then solve for our own. But that leaves too many options (we could assume that the sum would be 7 or 49 or anything between). Luckily we don’t need all of that information. We really only need to assume something about what the sum is mod 7. For instance, if you knew that the sum of all the numbers was \\(3 \\pmod 7\\), and you see the numbers 1, 2, 3, 4, 5, 6, (the sum of which is \\(21 \\equiv 0 \\pmod 7\\)), you would know that your number is 3. But that assumes that you know what the sum is (mod 7). Luckily you have a total of 7 people on your team! The strategy In the green room you assign a number from 1 to 7 (inclusive and not repeating) to each person in your team. So each person has a number and all numbers from 1 to 7 are represented. Then you all go into your respective isolation chambers and get your numbers and solve for your own assuming the sum mod 7 is whatever number you were assigned. Since the sum must be something from 1 to 7 mod 7, we know one person had the correct assumption, and since given the assumption, the solution is unique, we know one person will guess his or her own number correctly. And we’re done! ",
"url": "/blog/2018/06/02/7_friends.html"
},
{
"title": "The blue-eyed reconing!",
"excerpt": "In a world of blue eyed and brown eyed people, it’s dangerous to know your own eye color.\n",
"content": "The scenario: There’s an island far away from anything we would call recognizable with a particularly strange custom: in this town, any person who discovers his or her own eye color is compelled to go to the town square at 9:00AM the following morning and kill him or her self publicly, so it goes. It’s a part of their lives and they all obey this custom diligently. This island is so secluded that the inhabitants had not seen an outsider until the one fateful day Pat stopped by. On that day, they were so excited to see an outsider that they asked Pat to give a speech in front of the whole town. Not knowing the town’s customs, in Pat’s opening remarks, Pat mentioned, “it’s nice to see other blue eyed people out here”. The question is: what happens? What you know: The town is very secluded and hasn’t seen any outsiders before The town has the custom that they kill themselves when they find out their own eye color, so it goes. A newcomer casually mentions in front of the whole town “it’s nice to see other blue eyed people here” What happens? Hints (click to unblur) Think inductively This is kinda a repeat (sorry for that) but start with the case where there’s only 1 blue eyed person, then go from there. A solution The Solution If there are \\(n\\) many blue eyed people in the town (assuming blue eyed people are the minority), then on the \\(n^\\text{th}\\) days, all the blue eyed people kill themselves, so it goes. The Explanation If there is only 1 blue eyed person, then (s)he looks around after the speaker leaves and says, “there are no blue eyed people here!”, then quickly realizes that (s)he must be the only blue eyed person and (s)he goes to the town square the next day and commits suicide, so it goes. If there are \\(n+1\\) many blue eyed people, then each looks around and sees only \\(n\\) and therefore concludes that \\(n\\) days from now all of those poor bastards are going to kill themselves, so it goes. But then the \\(n^\\text{th}\\) day comes and none of them commit suicide (because they each think they’re not blue eyed, but the other \\(n\\) are). Then they each come to the conclusion that there must be another blue-eyed person around, and since they don’t see one, they each conclude that it’s themself. So on the \\(n+1^\\text{th}\\) day, they all go to the town square, so it goes. ",
"url": "/blog/2018/06/02/blue_eyed_people.html"
},
{
"title": "Prisoners and The Lightblub",
"excerpt": "A group of prisoners encounter an evil warden and his sadistic lightbulb. Who will win?\n",
"content": "The scenario: One day, a drunk warden waltzes into her prison feeling particularly sadistic and decides to put on a little show for herself. She told her guards to gather 100 prisoners into a room so that she could talk to them. When in front of the group of prisoners, she tells them of her plan: All of their prison sentences have been extended indefinitely, but they have a chance to earn their freedom – through her game. She explained that she arranged for a room to be secured and left with just a single light bulb inside and that nobody but them would be allowed inside. They would then be brought into the room one prisoner per day and allowed to turn the light on or off. It’s guaranteed that nobody else would be given access to the room other than themselves, and that in no way would anyone mess with the light. They were told that at any time, anyone could come forward and say “I believe all prisoners has been in the room” if they believe it, and if it’s true, all the prisoners go free. If the prisoner says the phrase and it’s not true, they all are killed by firing squad. Now, you can’t just say the phrase on the 100th day – I mean, you could, but it probably wouldn’t work because the prisoners can be sent in multiple times. What you know: There are 100 prisoners There is a room with a light bulb that only these prisoners have access to They can really only communicate through this light bulb At any time, any one of them can stop the game, and if all prisoners have been in the room with the light, they go free. If not all prisoners have been in that room, then they all lose the game (or get killed, or the warden steps on their pet bunny or something – something negative happens). What’s your strategy? Hints (click to unblur) There’s a non-zero probability (assuming the choice of prisoner is random) that it’s 100,000 years before all prisoners go into that room. So don’t try to find a perfect solution. Just any solution will work. If you haven’t found any solutions, how about putting one prisoner in charge? Some Solutions Solution 1 You can put one prisoner in charge and only allow that prisoner to ever turn off the light. Then, whenever anyone else gets in the room, if the light is off and they’ve never turned it on before, turn it on; otherwise, just leave it be. The chosen prisoner ends the game when he or she turns the light off for the 99th time. That solution is okay, but not great. It gets the job done, but it’ll take forever. Solution 2 My favorite solution is kinda complicated, but it’s really cool… So, you pick a number, say, 10. Then, you say make the following rules: You assign a notion of “value” to a light-bulb. For the first ten days, a light bulb is said to have a value of 1. Then every ten days after that, the value doubles, until it gets to 64 After 64, it goes back to 1. Everybody keeps a record in their own respective cells of the value they currently have. When a prisoner enters the room, if the light is on, the prisoner turns it off and adds the current day’s value to their own. Then, the prisoner writes out the value they have in binary. The value of the light bulb progresses to the next day’s value, let’s call it \\(\\lambda\\). If the prisoner has a “1” in the \\(\\lambda\\) place of the binary description of his or her own value, they turn on the light and decrease their own value by \\(\\lambda\\). So, for example, let’s pick 10 as our number of days per value thing. Then if today is the 9th day in the full cycle, then today’s value is 1 (and so is tomorrow’s). So if the light is on when I get in, I turn it off and increment my value. Let’s assume the light is on and I have a value of 20 (or 0010100, in our description). Then I turn off the light and find that I have a value of 21. Since tomorrow’s value is 1, and my binary description is 0010101, I return that point and turn on the light bulb, leaving me with 0010100. If on the other hand, it had been the tenth day and tomorrow’s value were 2, since I’d have a 0 in the 2’s place, I’d keep the point. The reason I find this solution so interesting is, it really gets to the core of money and value. In this case, the light is only twice as valuable on day 11 as it is on day 10 because we agree to artificially make it so, but as long as we’re doing that, the results are astounding and very real. Also, this is a completely decentralized solution! Come on, that’s amazing! ",
"url": "/blog/2018/06/02/prisoner_lightbulb.html"
},
{
"title": "The Monty Hall Problem",
"excerpt": "The infamous Monty Hall! You’re on a gameshow with three doors, two have goats and one has a brand new car! What will you win?!\n",
"content": "The scenario: You’re on a game show and they show you three doors (they’re all shut). They tell you behind two of these doors are goats, and behind one of them is a brand new car! They say you get to pick one door, then they’ll open up one of the remaining doors to reveal a goat, then ask you if you want to change your choice. What do you say? What you know: There are three doors Only one has the prize All doors have equal probability at the onset You choose one door The host shows a goat behind one of the other doors You’re asked if you want to change your choice to the last remaining door Hints (click to unblur) Just work out the probabilities Think of the microstates A solution A Strategy You do switch. The door you picked only has 1/3 probability of having the prize while the other door has 2/3 probability. A Description Let’s pretend like there are two different games: one where you pick one door and no chance to change your mind, and one where you pick two doors and if the car is behind either one you win. Obviously the first has a 1/3 probability and the second has a 2/3 probability of success. You can think of the strategy where you don’t switch as playing the first game and the one where you do switch as playing the second game. In the switching strategy, the first door you pick will never be the one you end with, so really, you’re not picking that door as much as you’re picking the other two and letting the host weed one out. At the end of the day, though, if you play with the switching strategy, then if the prize is behind either of the two doors you picked, you win! So it’s a 2/3 probability if you switch. ",
"url": "/blog/2018/06/03/monty_hall.html"
},
{
"title": "\\(n\\) people on a plane",
"excerpt": "In a world of seat stealing curmudgeons, will you get yours?!\n",
"content": "The scenario: \\(n\\) people are in line ready to board a full plane (so \\(n\\) people in \\(n\\) seats). The first person in line lost his boarding pass, so he just sits in any random old seat (but specifically not his own). From then on, the people boarding are so nice that they won’t confront someone sitting in their respective seats; instead, they’ll just take a random empty one. What’s the probability that the last person gets to sit in his or her own seat? What you know: There are \\(n\\) seats on the plane There are \\(n\\) passengers about to board the plane The first passenger sits in the wrong seat Every passenger after sits in his/her own seat if it’s available Every passenger sits in a random available seat if his/her seat is not available. You need to find out what the probability that the last person gets to sit in his/her own seat. Hints (click to unblur) It might be best to start with the base case (\\(n\\) = 3) What if the first person sat in a totally random seat (instead of specifically not his own)? A solution A description \\[ \\mathbb{P}\\left(\\cdot | n\\right) = \\frac{n-2}{2(n-1)} \\] The strategy Firstly, if \\(n = 1\\), the game is ill defined, if \\(n=2\\), the probability is 0 (because the first person always takes the wrong seat). If \\(n>2\\), then: There’s a \\(\\frac{1}{n-1}\\) chance that the first person takes the last person’s seat, thereby sealing the eternal aerial fate of the last boarder. So there’s a \\(\\frac{n-2}{n-1}\\) chance that the first jerk doesn’t take the last shlemazel’s seat. Every person after that either sits in their own seat (thereby not changing the odds for our lone hero of the skies), or sits in a random seat. When sitting in a random seat, they either: Sit in the first person’s seat thereby closing the book on this whole debacle, assuring the vacancy of the posterior protector for our helpless late-boarder. They sit in the last person’s seat, thereby ensuring their own place in hell along with our hero’s long lost comfort that only comes from sitting in your very own assigned seat. Or they sit in someone else’s seat, thereby passing the buck on to someone else and not changing the results at all. So, there’s a 50/50 shot (once we get past the first person, and assuming \\(n>2\\)). To summarize: there’s a \\(\\frac{n-1}{n-2}\\) probability that the last person’s seat is open after the first person takes a seat, then every turn after that has an equal probability of ending the game in a positive or a negative way. So the end formula is \\[ \\mathbb{P}\\left(\\cdot | n\\right) = \\frac{n-2}{n-1}\\cdot\\frac{1}{2} = \\frac{n-2}{2(n-1)} \\] The experiment I wrote a little python script that can be downloaded here. I essentially played this game 10,000 times for each \\(n\\) between 3 and 100, then plotted the mean for a particular \\(n\\) vs that \\(n\\), along with our theoretical value derived kinda loosely above. ",
"url": "/blog/2018/06/10/n_people_on_a_plane.html"
},
{
"title": "About the riddles category",
"excerpt": "In a world filled with lame riddles, where can you find good ones?!\n",
"content": "Far too many riddles are exceedingly lame. Scenarios like “John is twice as old as Julia, and 5 years older than Tim, how old is Sally?” are really just super lame and uninteresting. They’re work, not play. It’s not even that they’re difficult; it’s just that they’re annoying. You just write it down and it’s very basic linear algebra. Every time. So for that reason, you will not find any of these atrocities on this site. This site will be exclusively for riddles I find to be fun. I also won’t be posting riddles that start out with a lie as the premise. So there are no tricks nor gimmicks to the riddles I’ll be posting here. And finally, I won’t be posting riddles like “what stands on four legs in the morning, two in the afternoon, and three in the evening?” Not for any real reason; I just don’t want to. That’s all until one day I do. It’s my site, I do what I want! If you have any riddles you think are cool, please send them to me! I love riddles. ",
"url": "/blog/2018/06/24/about-riddles.html"
},
{
"title": "Boys in the hood",
"excerpt": "What’s the probability the other one is a boy?\n",
"content": "This post is really several riddles that get progressively more difficult. So don’t be dissuaded if you find the first few too easy (or if you don’t). Baby crazy Let’s say we know a person named Pat who has two kids and at least one is a boy. Assuming that across the population the probability of having a boy is 50%. What is the probability that the other kid is a boy as well? It’s 1/3 (or $33.\\bar3$%). To see this, let’s just work out the possible states (the following states will be written as younger,older): B,B B,G G,B G,G But since we know G,G is not a possibility (because Pat has at least one boy), and only one of the other three have the other child being a boy. This might be easier to see and believe if you consider this using coin flips. To make things a little more difficult (as babies often do), What’s the probability for an arbitrary probability $p$ of having a boy? Solution We have the same states as before, but let’s first define some things to help us out: \\[ \\begin{align} \tBG=& \\mathbb{P}(\\text{boy})\\cdot \\mathbb{P}(\\text{girl}) & GB=& \\mathbb{P}(\\text{girl})\\cdot \\mathbb{P}(\\text{boy}) & BB=&\\mathbb{P}(\\text{boy})\\cdot \\mathbb{P}(\\text{boy}) \\\\ \t=& p(1-p) & =& (1-p)p & =& p^2 \\end{align} \\] now the probability is: \\[ \\begin{align} \t\\frac{BB}{BG + GB + BB} =& \\frac{p^2}{p(1-p) + (1-p)p + p^2} \\\\ \t=&\\frac{p}{2(1-p)+p} = \\frac{p}{2-p} \\end{align} \\] In particular, if $p=0.5$, we get \\[ \\begin{align} \t\\frac{p}{2(1-p)+p} = \\frac{p}{2-p} =& \\frac{0.5}{2-0.5} \\\\ \t\\frac{0.5}{1.5} = \\frac{1}{3} \\end{align} \\] And just because I apparently like(?) making graphs… In case you’re eager to see the like 5 lines of code that went into making that graph, the notebook can be downloaded here. What if we know that the older child is a boy? The whole paradoxical part of the problem goes away once we know the position of the known sex baby. By that I mean that the only possible states are: G,B B,B So it’s 50/50. What if I told you that Pat has 2 kids, and one is a boy born in summer? This is where it gets a little tricky. The states are now: far too many to list. But, basically, it’s \\[ B\\cdot{1 \\text{ seasons}}, G\\cdot{4 \\text{ seasons}} \\\\ G\\cdot{4 \\text{ seasons}}, B\\cdot{1 \\text{ seasons}} \\\\ B\\cdot{3 \\text{ seasons}}, B\\cdot{1 \\text{ seasons}} \\\\ B\\cdot{1 \\text{ seasons}}, B\\cdot{4 \\text{ seasons}} \\] Note that we’re not double counting the case where both are boys born in summer. So there are a total of 15 possible outcomes, and 7 of them have a boy as the second child. So it’s \\(\\frac{7}{15}\\). What if we only know that one child is a boy born on December 23rd? Or we could consider something a bit more general with probability \\(q\\). Solution We’re still assuming the probability of having a boy is $p$. Then we can describe our exact table above like so: \\[ \\begin{align} \t\\mathbb{P}(B\\cdot{1 \\text{ seasons}}, G\\cdot{4 \\text{ seasons}}) =& pq(1-p) & =& pq(1-p) \\\\ \t\\mathbb{P}(G\\cdot{4 \\text{ seasons}}, B\\cdot{1 \\text{ seasons}}) =& (1-p)pq & =& pq(1-p) \\\\ \t\\mathbb{P}(B\\cdot{3 \\text{ seasons}}, B\\cdot{1 \\text{ seasons}}) =& p(1-q)pq & =& p^2q(1-q) \\\\ \t\\mathbb{P}(B\\cdot{1 \\text{ seasons}}, B\\cdot{4 \\text{ seasons}}) =& pqp & =& p^2q \\end{align} \\] So we have: \\[ \\begin{align} \t\\mathbb{P} =& \\frac{p^2q(1-q) + p^2q}{p^2q+p^2q(1-q) + 2pq(1-p)} \\\\ \t=& \\frac{pq\\left(p(1-q) + p\\right)}{pq\\left(p+p(1-q)+2(1-p)\\right)} \\\\ \t=& \\frac{p(1-q) + p}{p+p(1-q)+2(1-p)} \\\\ \t=& \\frac{p(2-q)}{2-pq} \\end{align} \\] And we can see, as \\(q\\to0\\), \\(\\mathbb{P}\\to p\\), and as \\(q\\to1\\), \\(\\mathbb{P}\\to \\frac{p}{2-p}\\) The closest thing I can give toward intuition here is this: let’s take the boy-in-summer example, and assume there’s a 25% chance of being born in summer. Then our universe of possibilities is restricting the known child to 1/4 of the possibilities otherwise, but there is no such restriction on the other child. So in that way, it’s in a sense down-weighting the effect of the known child (by scaling up all of the other possibilities). ",
"url": "/blog/2018/06/24/boys-in-the-hood.html"
},
{
"title": "Getting across the river",
"excerpt": "The river-crosser’s union problem set #1\n",
"content": "Part 1 Foxes, Rabbits, and Lechuga You stand at the edge of a riverbed with your pet fox, a pet rabbit, and a head of lettuce. You need to cross the river, but can only take one of the three with you each time you cross. If the fox is ever left unsupervised with the rabbit (s)he’ll eat it, and the same for the rabbit and the head of lettuce. Just to clarify what you know: You have a fox, rabbit, and a head of lettuce The fox will eat the rabbit, and the rabbit will eat the lettuce (if given the opportunity) You can only take one across at a time How can you get all three to the other side of the river without any of them getting eaten? Solution: First let’s define the sides as $A$ and $B$ (we’re on side $A$). Take the rabbit to side $B$ (the fox is left with the lettuce on side $A$, but that’s okay) Go back to side $A$ without anything in the boat. Take the fox to side $B$ Go back to $A$ with the rabbit Take the lettuce to side $B$ Go back alone Take the rabbit to side $B$ Profit heavily Part 2 Pesky children and over-protective parents This riddle is courtesy of Tazo Lezhava Now 5 cannibalistic parents stand on the edge of a riverbed each with their first-born child. These parents are very over-protective and won’t let their kids be with any of the other parents if they’re not there as well. There is a boat that can take 3 people at a time. How do they use that boat to get all 10 people (the 5 cannibalistic parents and their first-borns) to the other side of the river without any kids getting eaten? Solution: Solution courtesy of Andrew Robinson Again, let’s call them sides $A$ and $B$, and we’re on $A$. Let’s also call the children be denoted by the numbers $1,2,3,4,5$, and the parents by the letters $a,b,c,d,e$, where $a$ is $1$’s parent and so on. move side $A$ side $B$ $1,2,3,4,5,a,b,c,d,e$ $\\emptyset$ $\\rightarrow 1,2,3$ $4,5,a,b,c,d,e$ $1,2,3$ $\\leftarrow 3$ $3,4,5,a,b,c,d,e$ $1,2$ $\\rightarrow 3,4$ $5,a,b,c,d,e$ $1,2,3,4$ $\\leftarrow 4$ $4,5,a,b,c,d,e$ $1,2,3$ $\\rightarrow a,b,c$ $4,5,d,e$ $1,2,3,a,b,c$ $\\leftarrow 3,c$ $3,4,5,c,d,e$ $1,2,a,b$ $\\rightarrow c,d,e$ $3,4,5$ $1,2,a,b,c,d,e$ $\\leftarrow 2$ $2,3,4,5$ $1,a,b,c,d,e$ $\\rightarrow 2,3,4$ $5$ $1,2,3,4,a,b,c,d,e$ $\\leftarrow 4$ $4,5$ $1,2,3,a,b,c,d,e$ $\\rightarrow 4,5$ $\\emptyset$ $1,2,3,4,5,a,b,c,d,e$ And we’re done. This is a pretty confusing way to show it, but with that many moving pieces, there aren’t too many great ways to visualize this. At least none that I know of. ",
"url": "/blog/2018/06/26/hares-and-rivers.html"
},
{
"title": "Let's talk Homotopy and Algebra",
"excerpt": "Just a little dip into homotopy via the Fundamental Theorem of Algebra\n",
"content": "So this post is going to be a bit more terse than most. In fact, the objective of this post will be to develop the theory of Homotopy very briefly, with the goal of proving the Fundamental Theorem of Algebra. Homotopy Given two continuous maps \\(f,g : \\mathbb{R}^n\\to\\mathbb{R}m\\), a homotopy between \\(f,\\) and $g$ is a continuous map $h: \\mathbb{I}\\times\\mathbb{R}^n\\to\\mathbb{R}^m$ such that $h(0,x)=f(x)$ and $h(1,x) = g(x)$. Basically it’s a continuous deformation of one map into the other. It’s a bit stronger than just homeomorphism of topological spaces. You can see this because two interlocked rings are topologically homeomorphic to non-interlocked rings, but they’re not homotopic because you’d have to split one of the rings. Enough of all that! Let’s get to the proof already! The Fundamental Theorem of Algebra The statement If $p\\in\\mathbb{C}[x]$ is a non-constant polynomial, then it has at least one root in $\\mathbb{C}$. A consequence of this theorem is that any non-constant polynomial over $\\mathbb{C}$ has all of its roots in $\\mathbb{C}$. Or equivalently, any polynomial of degree $n$ over $\\mathbb{C}$ has $n$ roots in $\\mathbb{C}$. The proof Let $p(x) = \\sum\\limits_{i=1}^n a_i x^i \\in\\mathbb{C}[x]$ be a polynomial without any roots in $\\mathbb{C}$. For simplicity, we assume that the leading coefficient is 1. There is no loss of generality in doing so. We then define a function Then for each $r$ in $\\mathbb{C}$, $f_r:\\mathbb{C}\\to S^1$ is a map from $\\mathbb{C}$ to $S^1$. Furthermore, given $r_0,r_1\\in\\mathbb{C}$, we can define a homotopy from $f_{r_0}$ to $f_{r_1}$ as $g(t,s) = f_{r_0+t(r_1-r_0)}(s)$. So for all $r\\in\\mathbb{C}$, $f_r$ is homotopic to $f_0$. Since $f_0(s) = 1$ for all $s\\in\\mathbb{C}$, we have that $f_r$ are all in the same homotopy class as the constant function on $S^1$. We now pick $z,r$ such that $|z| = r \\gt \\max(1,\\sum |a_i| )$. We then note that Cauchy-Schwarz gives us the following: Which means $|z^n - \\sum a_i z^i \ eq 0|$. So, we can define yet another function $q(t,z) = z^n +t(\\sum\\limits_{i=1}^{n-1}a_iz^i)$. Then $q$ is a homotopy between $z^n$ and $p(z)$. We now note that when $t=1$, if we make a similar construction to $f_r$ but with this parameterized polynomial instead of $p$, we get a homotopy between $f_r$ and just going around the circle $n$ times (with $t=0$). But since $f_r$ is homotopic to a point on $S^1$, and going around the circle any non-zero number of times is not homotopic to a point, we know that $n=0$. So the only polynomials without roots in $\\mathbb{C}$ are constant. Kinda cool, huh? This proof was courtesy of Algebraic Topology by Allen Hatcher. ",
"url": "/blog/2018/06/29/homotopy.html"
},
{
"title": "Welcome to the site of a MathemaNiskin!",
"excerpt": "A boilerplate for my mind. Here is where I will blog about whatever: Math, Statistics, Economics, Computer Science, Algorithms, etc.\n",
"content": "Welcome to my home on the internet. I’ll be posting sporadically about whatever interests me. Generally that’ll be more STEM oriented, but I imagine once I get more used to this whole blogging thing, I’ll branch out. For now the posts are mainly about Math, Algorithms, Optimization, Statistics, Sampling, etc. Enjoy the site and please feel free to email me, if you come across any issues, typos, mistakes, etc. or if you just feel like talking with me. I’m always up to chatting with new people. If you’re interested in tutoring, visit Math-Busters. If you’re in the San Francisco Bay area, and would like in-person tutoring, on the other hand, feel free to email me for a recommendation. ",
"url": "/"
},
{
"title": "Post Categories",
"excerpt": "Post Category index\n",
"content": " ",
"url": "/posts_categories.html"
},
{
"title": "Search",
"excerpt": "Search for a page or post you’re looking for\n",
"content": " ",
"url": "/search.html"
},
{
"title": "Aaron Niskin's Blog - page 2",
"excerpt": "\n",
"content": " ",
"url": "/blog/2/"
},
{
"title": "Aaron Niskin's Blog - page 3",
"excerpt": "\n",
"content": " ",
"url": "/blog/3/"
},
{
"title": "Aaron Niskin's Blog - page 4",
"excerpt": "\n",
"content": " ",
"url": "/blog/4/"
},
{
"title": "Aaron Niskin's Blog - page 5",
"excerpt": "\n",
"content": " ",
"url": "/blog/5/"
}
]