-
Another annotated transformer
If you’re anything like me, while trying to implement transformers, you’ve read the original attention is all you need paper, the annotated transformer, the updated version, d2l.ai, and had to cobble them all together to get something going. This post is an attempt to make that proccess easier for people like me in a short and to-the-point style. You can think of this as a bare-bones implementation with a whole lot of documentation. Notes: this post is about how transformers are implemented, not why they’re implemented the way they are. This post assumes the reader understands what is meant by training a model, parameters and hyperparameters, dense layers, and so on. We’ll be using jax and flax for this implementation. Overview The following is the transformer architecture di...
-
Intro To Algebraic Geometry, 1.2
Intro In the last post we covered the definition of places and valuations (sometimes called valuation rings). We covered how there is a one to one correspondence between valuations and places (up to equivalences). That was the entirety of chapter 1 section 1. This post will be about chapter 1 section 2. Valuations Group order Let $\Gamma$ be a multiplicative commutative group. If there exists a multiplicatively closed subset $S\subset \Gamma$ that does not contain the group’s $1$ element and has the property that for each $x$ in $\Gamma$ either $x\in S$ or $x^{-1}\in S$1, we can define a strict total order on $\Gamma$ as: \[a \lt b \iff ab^{-1}\in S\] Just to refresh everyone, a total order is a relation with 4 properties, The “total” part (fully connected): for all $a, b \in ...
-
Intro to Algebraic Geometry, 1.1
Intro I haven’t really studied any Math seriously since I graduated with my undergrad in 2015. It seems I kinda miss it, so I’ll be going over Introduction to Algebraic Geometry by Serge Lang today. It starts out with the theory of Places. The idea being that homomorphisms between fields are necessarily trivial. Before we prove that, we define a field homomorphism as a function $\varphi:F\to K$ such that: $\forall a,b \in F, \varphi(a + b) = \varphi(a) + \varphi(b)$ $\forall a,b \in F, \varphi(a \cdot b) = \varphi(a) \cdot \varphi(b)$ $\varphi(0) = 0$, and $\varphi(1) = 1$ Axiom 3 only guarantees that the homomorphism doesn’t send every element to 0. Field homomorphisms are injective Computational proof It’s been a long time since I’ve done any math, so let’s warm up a bit w...
-
A mile south, east, and north
The scenario You’re standing on the surface of the earth (as opposed to whatever your usual hangouts are), and walk a mile south, a mile east, and then a mile north and end up where you started. You see a bear. What color is the bear? You’ve heard this one before, but just in case, the answer is: White. You are at the north pole and it was a polar bear. But that’s not the only point that it could be. The riddle is, can you describe the full set of solutions? Solution Pick any line of longitude where the length of the line is $1/n$ miles long, then go a mile north of that and you have a solution. In fact, this set is dense around the ring with circumference of 1 mile.
-
The sequential shooters
There are N people indexed by $1,..,N$ (because down with the zero-indexers! Long live Peano!) standing in a circle. They’re standing such that $i$ is between $i-1$ and $i+1$ (unless $i \in{1,N}$, for obvious reasons). Then starting with $1$ and going around the circle in turn, each person shoots the person directly to their left. This stops when there’s only one person left alive. For example: if $N=5$, it would go: 1 shoots 2 3 shoots 4 5 shoots 1 3 shoots 5 and only 3 remains. The task Given $N$, what is the index (1-indexed, unfortunately) of the last shooter standing? For example: if $N=5$, the answer is $3$. Solution: The Theoretical Solution Every number $N$ can be written uniquely as $2^i + \lambda$, where $i,\lambda\in\mathbb{N}_{\geq 0}$ and $\lambda \lt...
-
The Sequential Doors and The Data Science Analogue
This post is going to be something of a combination riddle and blog post. The setup There are $N$ doors indexed by $1,..,N$ along a wall and they’re all closed. Then for each $i\in{1,2,\dots,N}$, you toggle every $i$th door. So at first ($i = 1$) you toggle every door. Then $i=2$ you toggle every second door – the even doors ($2, 4, 6, \dots,$), and so on. For example, if $N=4$, then the following is the sequence of events: Open every door. Close doors 2 and 4 Close door 3 Open door 4 It results in 1 and 4 are open, while 2 and 3 are closed. The task Given $N$, describe the “door is closed” function. By that I mean, which doors are open and which are closed at the end? Solution Only the perfect squares are open at the end. So $1,4,9,16,\dots$ are open, and $2,3,5,...
-
Quantile Regression Yields Quantiles
Background If you’ve ever read any papers referring to quantile regression (QR), you undoubtedly have read “MSE yields means, MAE yields medians, and QR yields quantiles”, but what does that even mean? Like not just hand-wavy nonsense, but really, what does that mean? After all, these are just loss functions and therefore don’t yield anything at all. Even the regressions yields coefficients, or, more generally, models; so what’s with the mean and median talk? It turns out they’re talking about the distribution of residuals1. In this post, we’ll show what they mean, how that’s the case, and we’ll investigate a bit more about QR. Quantile Regression Definition Given a scaler $q\in[0,1]$, let $l_q:\mathbb{R}\to\mathbb{R}_+$ be: \[\begin{align*} l_q(r) =& r\cdot\left(q - \mathbb{I...
-
Gradient Boosting Seen as Gradient Descent
tldr The short version of this whole thing is that we can see the gradient boosting algorithm as just gradient descent on black-box models. Typically gradient descent involves parameters and a closed form solution to the gradient of your learner as a way to update those parameters. In the case of trees, for instance, there are no parameters you want to update1, and somewhat more importantly they aren’t even continuous predictors, much less differentiable! But gradient boosting can be seen as gradient descent on black box models resulting in an additive black box model. GBM The naive algorithm A description of the algorithm I’m talking about can be found on wikipedia, but I’ll go over the algorithm somewhat quickly here just so we’re on the same page. The basic idea is that Gradien...
-
A little geometric walk down risk model lane
Background Earlier today I asked a colleague what the formal definition of a Unit Factor Portfolio is. He told me “it’s a portfolio with unit exposure to the factor, and orthogonalized to all the others.” That didn’t make sense to me, so I’m going to dig in here. We (quants) tend to use some form of a risk model to optimize our portfolios1 given a vector of alphas2. This post assumes that you know what a risk model is. Today we’ll be going over risk models from a geometric point of view. Mathematical Background Random Variables A continuous, real valued, random variable3 $X$ can be seen as a vector in $\mathbb{R}^n$, where $n$ is the number of observations you have. So if I take 7 samples of a normally distributed random variable, it might look like this: >>> import numpy...
-
Let's Cover The Fundamental Group (pt 1)
Jumping right in Prerequisites: Firstly, this post builds off of my previous post, so if you’re learning these things for the first time and find yourself kinda lost, start there. In addition to the previous post, the following are mathematical definitions you should probably know before reading this post. If you don’t know any of these, trust google. Group. Homomorphism Kernel Subgroup Normal subgroup Homotopy Retraction Star Convex Moving onward Definition: Given a path $\alpha$ from $x_0$ to $x_1$ in $X$, we define $\hat\alpha:\pi_1(X,x_0)\to\pi_1(X,x_1)$ as follows: \[\hat\alpha([f]) = [\bar\alpha]\star[f]\star[\alpha]\] And it’s a theorem that $\hat\alpha$ is a group isomorphism. Definition: Given a continuous map $h:(X,x_0)\to(Y,y_0)$, we define: $h_\...