Notes:
The following is the transformer architecture diagram taken from the original paper. We’ll be referring back to it often in this post.
At a high level, the transformer is an encoder-decoder model; it takes a sequence of tokens from a source (e.g. English words) and learns to translate that into a destination sequence (e.g. French words). There are three flavors of transformers: encoder/decoder, encoder-only, and decoder-only. This post will focus on encoder/decoder transformers.
from typing import Callable, Sequence
import chex
import jax.numpy as jnp
import jax.random as jran
import jax.tree_util
import optax
from flax import linen as nn
from IPython.display import display
from tqdm.auto import tqdm
key = jran.PRNGKey(0)
As we said before, the transformer model is a sequence to sequence model. Natural language lends itself to many possible sequence definitions (words, characters, bigrams, etc.), so strictly speaking we need to define a tokenizer before we even get to the Embeddings.
The tokenizer takes something like natural language and returns a sequence (typically of unique natural numbers – IDs). We’ll tokenize each character "a"
through "z"
along with the requisite start and pad tokens (represented by "<start>"
and "<pad>"
respectively – to be explained later). So our tokens are ["a", "b", "c", ..., "z", "<start>", "<pad>"]
, and our tokenizer maps those to [0, 1, 2, ..., 25, 26, 27]
.
vocab = {chr(97 + i): i for i in range(26)}
vocab['<start>'] = len(vocab)
vocab['<pad>'] = len(vocab)
One reason transformers really took off (in their early days) was that you could easily train them in batches without the recurrence required of something like RNNs. In order to do that, we need to arrange our sequences into a batch. But what if the sequences are of different lengths? For that we use numpy’s pad function: we fill in the empty spots at the end of the shorter sequences with our '<pad>'
token so that we can ignore these in our loss.
def str2ids(txt, vocab=vocab):
return jnp.array([vocab[x] for x in txt])
def strs2ids(*txts, vocab=vocab):
ids = [str2ids(x, vocab=vocab) for x in txts]
maxlen = max([len(x) for x in ids])
return jnp.stack([jnp.pad(jnp.array(x), pad_width=(0, maxlen - len(x)),
mode='constant', constant_values=vocab['<pad>'])
for x in ids])
def ids2str(ids, vocab=vocab):
x = [list(vocab)[x] for x in ids]
x = [y if y != '<pad>' else '~' for y in x]
return ''.join(x).rstrip('~')
def ids2strs(ids, vocab=vocab):
return [ids2str(x, vocab=vocab) for x in ids]
seq = ['hey', 'there', 'ma', 'dood']
assert ids2strs(strs2ids(*seq)) == seq
display(strs2ids(*seq))
del seq
Array([[ 7, 4, 24, 27, 27],
[19, 7, 4, 17, 4],
[12, 0, 27, 27, 27],
[ 3, 14, 14, 3, 27]], dtype=int32)
Notice how 27
appears in that matrix – 27
is our "<pad>"
token.
After tokenization, the Input Embedding is the start of the data flow. An embedding is a mapping from a discrete set with cardinality $N$ to a subset of $\mathbb{R}^d$ where $d\ll N$. This is generally done in such a way that a the topology is preserved. It can be broken down into two parts:
The second part comes naturally from our vocabulary definition (we can map each token to the integer the vocabulary maps it to).
The first part is already implemented in flax
. We instantiate an example nn.Embed
layer below with $N=28$ and $d=2$ (using the notation above). You can see that the embedding layer only has one parameter, 'embedding'
, which is a $N\times d$ matrix. $d$ is called the embedding dimension.
model = nn.Embed(len(vocab), 2)
params = model.init(key, jnp.array([1]))
jax.tree_util.tree_map(jnp.shape, params)
{'params': {'embedding': (28, 2)}}
If we get some embeddings, we see that the model is doing exactly what we said it would do (it’s just grabbing the $i^\text{th}$ row of the matrix).
model.apply(params, str2ids('abc'))
Array([[ 0.35985 , -0.75417924],
[-1.206328 , 0.7793859 ],
[ 0.11096746, -1.0079818 ]], dtype=float32)
params['params']['embedding'][0:3]
Array([[ 0.35985 , -0.75417924],
[-1.206328 , 0.7793859 ],
[ 0.11096746, -1.0079818 ]], dtype=float32)
These weights get tuned thereby “learning” an $d$-dimensional embedding for each token.
The positional encoding is the same size as a single observation fed to the model and added to each observation in the batch. We use the same function as they used in the original paper. Let $X\in\mathbb{R}^{s\times d}$ where $s$ is the max sequence length, and $d$ is the embedding dimension.
\[f(X_{i,j}) = \begin{cases} \sin\left(i/\left(10000^{j/d}\right)\right) & \text{if } j\equiv 0\pmod{2} \\ \cos\left(i/\left(10000^{(j-1)/d}\right)\right) & \text{if } j\equiv 1\pmod{2} \end{cases}\]def sin_pos_enc(sequence_length, embed_dim):
"""create sin/cos positional encodings
Parameters
==========
sequence_length : int
The max length of the input sequences for this model
embed_dim : int
the embedding dimension
Returns
=======
a matrix of shape: (sequence_length, embed_dim)
"""
chex.assert_is_divisible(embed_dim, 2)
X = jnp.expand_dims(jnp.arange(sequence_length), 1) / \
jnp.power(10000, jnp.arange(embed_dim, step=2) / embed_dim)
out = jnp.empty((sequence_length, embed_dim))
out = out.at[:, 0::2].set(jnp.sin(X))
out = out.at[:, 1::2].set(jnp.cos(X))
return out
sin_pos_enc(5, 2)
Array([[ 0. , 1. ],
[ 0.841471 , 0.5403023 ],
[ 0.9092974 , -0.41614684],
[ 0.14112002, -0.9899925 ],
[-0.7568025 , -0.6536436 ]], dtype=float32)
We’ll come back to this later.
Transformers are built around the Multi-Head Attention you see in the picture, but MHA is itself built on attention. Attention is just a function that takes 3 matrix arguments (query, key, and value) and aggregates them to a vector. There are a few forms of attention but we’ll focus on the one used in the seminal paper: scaled dot product attention.
Let $Q\in\mathbb{R}^{n\times d},K\in\mathbb{R}^{m\times d},V\in\mathbb{R}^{m\times v}$ be the query, key, and value. Basically we just need the shapes to be fit for the matrix multiplication below. A good reference for this is d2l.ai.
\[\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V\in\mathbb{R}^{n\times v}\]The $\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)$ part is called the attention weights.
It’s worthwhile to note that there are no learnable weights in this formula.
This formula is deceptive in 2 ways:
Let $X\in\mathbb{R}^k$ a vector, then $\text{softmax}(X)\in\mathbb{R}^k$.
\[\text{softmax}(X)_i = \frac{e^{X_i}}{\sum_{j=0}^{k-1}e^{X_j}}\]It’s just normalization with a monotonic function applied, meaning the relative ranking of the elements of $X$ aren’t changed. For more on this, see this post.
For masked softmax, we’ll be taking the approximate approach. Because of the sum in the denominator and the exponentiation, it’s unwise to mask with 0 ($e^0 = 1$). Instead we’ll mask with a very large negative number before we exponentiate so that the result is close to 0 ($e^{-\infty} \approx 0$).
def masked_softmax(args, mask):
if mask is not None:
args = args + (mask.astype(args.dtype) * -10_000.0)
return nn.softmax(args)
def dot_prod_attn(q, k, v, dropout=lambda x: x, mask=None):
# NxD @ DxM => NxM
# (B[, H], N, M)
attn_scores = q @ k.swapaxes(-2, -1) / jnp.sqrt(q.shape[-1])
attn_weights = masked_softmax(attn_scores, mask)
# (B[, H], N, D)
out = dropout(attn_weights) @ v
return out, attn_weights
# these are 13 batches of Q, K, V matrices arranged into rank 3 tensors
Q = jran.normal(jran.fold_in(key, 0), (13, 3, 7))
K = jran.normal(jran.fold_in(key, 1), (13, 5, 7))
V = jran.normal(jran.fold_in(key, 2), (13, 5, 11))
print(jax.tree_map(jnp.shape, dot_prod_attn(Q, K, V)))
del Q, K, V
((13, 3, 11), (13, 3, 5))
Multi-head attention involves stacking a collection of attention “heads” and adding some learned weights in the mix. As such, we’ll start with attention heads and progress to multi-head attention.
At a high level, mutli-head attention is a bunch of stacked attention layers. But given that there are no learnable weights in the attention heads (they query, key, and values are all arguments), each would yield the same result – not so useful. So instead, we train a linear layer per attention head, and then concatenate the results.
Many implementations use one linear layer and reshape the output rather than storing a collection of linear models. At first this might not seem kosher, but it is. The picture below shows how 2 attention heads (red and blue) can be trained with one linear model.
class MultiHeadAttention(nn.Module):
n_heads: int
size_per_head: int
attn_dropout: float
fc_dropout: float
attn_fn: Callable = dot_prod_attn
@nn.compact
def __call__(self, q, k, v, mask=None, *, training=False):
"expected shape: Batch, [N|M], Dim"
B, N, D = q.shape
_, M, _ = k.shape
def qkv_layer(x, name):
x = nn.Dense(self.n_heads * self.size_per_head, name=name)(x)
x = x.reshape((B, -1, self.n_heads, self.size_per_head)).swapaxes(1, 2)
return x
# BxNxD => BxHxNxP
q = qkv_layer(q, 'query_linear')
# BxMxD => BxHxMxP
k = qkv_layer(k, 'key_linear')
# BxMxD => BxHxMxP
v = qkv_layer(v, 'value_linear')
if mask is not None:
# accounting for reshape in qkv_layer
# B[xN]xN => Bx1[xN]xN
mask = jnp.expand_dims(mask, 1)
if mask.ndim < q.ndim:
# softmax is applied to dim -1
# Bx1xN => Bx1x1xN
mask = jnp.expand_dims(mask, -2)
attn_do = nn.Dropout(self.attn_dropout, deterministic=not training, name='attn_dropout')
out, attn_weights = self.attn_fn(q, k, v, attn_do, mask=mask)
# uncomment to keep attention weights in state
# self.sow('intermediates', 'weights', attn_weights)
out = out.swapaxes(1, 2).reshape((B, N, -1))
out = nn.Dense(D, name='output_linear')(out)
out = nn.Dropout(self.fc_dropout, deterministic=not training, name='fc_dropout')(out)
return out
As we all know at this point, these models can get quite big. It turns out, transformers are just naturally large models. Below we show that even a pathalogically simple MultiHeadAttention
layer has 63 parameters!
batch_size = 2
sequence_length = 5
embed_dim = 3
n_heads = 2
size_per_head = 2
X = jnp.arange(batch_size * sequence_length * embed_dim)
X = X.reshape((batch_size, sequence_length, embed_dim))
mdl = MultiHeadAttention(n_heads, size_per_head, attn_dropout=0.2, fc_dropout=0.3)
params = mdl.init(key, X, X, X, mask=(jnp.max(X, axis=-1) < 0.8).astype(jnp.float32))
nn.tabulate(mdl, key, console_kwargs={'force_jupyter': True})(X, X, X)
del batch_size, sequence_length, embed_dim, n_heads, size_per_head, X, mdl
MultiHeadAttention Summary ┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ path ┃ module ┃ inputs ┃ outputs ┃ params ┃ ┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ │ MultiHeadAttention │ - int32[2,5,3] │ float32[2,5,3] │ │ │ │ │ - int32[2,5,3] │ │ │ │ │ │ - int32[2,5,3] │ │ │ ├───────────────┼────────────────────┼──────────────────┼──────────────────┼──────────────────────┤ │ query_linear │ Dense │ int32[2,5,3] │ float32[2,5,4] │ bias: float32[4] │ │ │ │ │ │ kernel: float32[3,4] │ │ │ │ │ │ │ │ │ │ │ │ 16 (64 B) │ ├───────────────┼────────────────────┼──────────────────┼──────────────────┼──────────────────────┤ │ key_linear │ Dense │ int32[2,5,3] │ float32[2,5,4] │ bias: float32[4] │ │ │ │ │ │ kernel: float32[3,4] │ │ │ │ │ │ │ │ │ │ │ │ 16 (64 B) │ ├───────────────┼────────────────────┼──────────────────┼──────────────────┼──────────────────────┤ │ value_linear │ Dense │ int32[2,5,3] │ float32[2,5,4] │ bias: float32[4] │ │ │ │ │ │ kernel: float32[3,4] │ │ │ │ │ │ │ │ │ │ │ │ 16 (64 B) │ ├───────────────┼────────────────────┼──────────────────┼──────────────────┼──────────────────────┤ │ attn_dropout │ Dropout │ float32[2,2,5,5] │ float32[2,2,5,5] │ │ ├───────────────┼────────────────────┼──────────────────┼──────────────────┼──────────────────────┤ │ output_linear │ Dense │ float32[2,5,4] │ float32[2,5,3] │ bias: float32[3] │ │ │ │ │ │ kernel: float32[4,3] │ │ │ │ │ │ │ │ │ │ │ │ 15 (60 B) │ ├───────────────┼────────────────────┼──────────────────┼──────────────────┼──────────────────────┤ │ fc_dropout │ Dropout │ float32[2,5,3] │ float32[2,5,3] │ │ ├───────────────┼────────────────────┼──────────────────┼──────────────────┼──────────────────────┤ │ │ │ │ Total │ 63 (252 B) │ └───────────────┴────────────────────┴──────────────────┴──────────────────┴──────────────────────┘ Total Parameters: 63 (252 B)
We’re going to want to see keep track of how many parameters we have as we go, and looking at a giant table is just not very efficient. To that end, let’s write a little function to do this:
def num_params(params):
param_sizes = jax.tree_map(lambda x: jnp.prod(jnp.array(jnp.shape(x))), params)
param_size_leafs, _ = jax.tree_util.tree_flatten(param_sizes)
return jnp.sum(jnp.array(param_size_leafs)).item()
print(f'{num_params(params) = }')
del params
num_params(params) = 63
The AddAndNorm
and FeedForward
layers are so simple that many implementations don’t implement them explicitly. We’ll implement them just so our code looks like the diagram.
class AddAndNorm(nn.Module):
"""The add and norm."""
@nn.compact
def __call__(self, X, X_out):
return nn.LayerNorm()(X + X_out)
class FeedForward(nn.Module):
"""a 2-layer feed-forward network."""
hidden_dim: int
@nn.compact
def __call__(self, X):
D = X.shape[-1]
X = nn.Dense(self.hidden_dim)(X)
X = nn.relu(X)
X = nn.Dense(D)(X)
return X
The encoder takes a sequence of tokens as input, and outputs a sequence of contextual embeddings. This means the embeddings for “light” in the sequence “light bulb” will be different from the one in “light weight”, a major improvement over non-contextual embeddings like word2vec.
The Encoder
is a combination of the various layers we’ve already built up along with several EncoderLayer
s (which are themselves just combinations of previously defined layers). This section is going to be short.
Note the EncoderLayer
takes one argument (neglecting the mask) and feeds that one argument as the query
, key
, and value
in the Multi-Head Attention
layer. This can be seen by following the arrows in the diagram.
class EncoderLayer(nn.Module):
hidden_dim: int
n_heads: int
size_per_head: int
attn_dropout: float
fc_dropout: float
def setup(self):
self.attn = MultiHeadAttention(n_heads=self.n_heads,
size_per_head=self.size_per_head,
attn_dropout=self.attn_dropout,
fc_dropout=self.fc_dropout)
self.aan_0 = AddAndNorm()
self.ff = FeedForward(hidden_dim=self.hidden_dim)
self.aan_1 = AddAndNorm()
def __call__(self, X, mask=None, *, training=False):
X1 = self.attn(X, X, X, mask=mask, training=training)
X = self.aan_0(X, X1)
X1 = self.ff(X)
X = self.aan_1(X, X1)
return X
class Encoder(nn.Module):
pos_encoding: Callable[[int, int], jnp.array]
vocab_size: int
embed_dim: int
layers: Sequence[EncoderLayer]
@nn.compact
def __call__(self, X, mask=None, *, training=False):
B, N = X.shape
if mask is not None:
chex.assert_shape(mask, (B, N))
X = nn.Embed(self.vocab_size, self.embed_dim, name='embed')(X)
X = X * jnp.sqrt(self.embed_dim)
# X.shape[-2] is the sequence length
X = X + self.pos_encoding(X.shape[-2], self.embed_dim)
for layer in self.layers:
X = layer(X, mask=mask, training=training)
return X
There are quite a few parameters, but with some still pathologically small numbers, we get an astronomical $47,670$ parameters!
def layer_fn():
return EncoderLayer(hidden_dim=13,
attn_dropout=0.1,
fc_dropout=0.1,
n_heads=7,
size_per_head=17)
mdl = Encoder(pos_encoding=sin_pos_enc, vocab_size=len(vocab),
embed_dim=2 * 3 * 5,
layers=[layer_fn() for _ in range(3)])
batch = strs2ids('hey', 'there', 'ma', 'dood')
mask = (batch == vocab['<pad>'])
params = mdl.init(key, batch)
num_params(params['params'])
47670
The decoder generates new sequences given some input state sequence (maybe the output of the Encoder
). You build up a sequence by iteratively asking the model for the next token until either some stop criteria or you get a token signifying the end of the sequence (we’re using "<pad>"
for this). This iterative approach cannot be parallelized efficiently.
Transformers can train on full sequences without the recursion, but it requires the clever so called, causal masking. When computing gradients, it’s important that output token $i$ cannot attend to any later output token $i+k$ as they won’t be available in production.
def causal_mask(shape):
return jnp.triu(jnp.ones(shape, dtype=jnp.bool_), k=1)
causal_mask((1, 5, 5))
Array([[[False, True, True, True, True],
[False, False, True, True, True],
[False, False, False, True, True],
[False, False, False, False, True],
[False, False, False, False, False]]], dtype=bool)
One thing to note: decoder only transformer layers remove the cross attention layer (the middle attention).
class DecoderLayer(nn.Module):
hidden_dim: int
n_heads: int
size_per_head: int
attn_dropout: float
fc_dropout: float
@nn.compact
def __call__(self, X_enc, X_dec, enc_mask, dec_mask, *, training=False):
def attn(q, kv, mask, training, name):
mdl = MultiHeadAttention(n_heads=self.n_heads,
size_per_head=self.size_per_head,
attn_dropout=self.attn_dropout,
fc_dropout=self.fc_dropout,
name=f'{name}_attn')
out = mdl(q, kv, kv, mask=mask, training=training)
aan = AddAndNorm(name=f'{name}_addnorm')
return aan(q, out)
X_dec = attn(X_dec, X_dec, dec_mask, training, 'self')
X_dec = attn(X_dec, X_enc, enc_mask, training, 'src')
X1 = FeedForward(hidden_dim=self.hidden_dim)(X_dec)
X_dec = AddAndNorm()(X_dec, X1)
return X_dec
class Decoder(nn.Module):
pos_encoding: Callable[[int, int], jnp.array]
vocab_size: int
embed_dim: int
layers: Sequence[DecoderLayer]
@nn.compact
def __call__(self, X_enc, X_dec, enc_mask, *, training=False):
B, N = X_dec.shape[:2]
dec_mask = causal_mask((1, N, N))
X_dec = nn.Embed(self.vocab_size, self.embed_dim, name='embed')(X_dec)
X_dec = X_dec * jnp.sqrt(self.embed_dim)
# X.shape[-2] is the sequence length
X_dec = X_dec + self.pos_encoding(X_dec.shape[-2], self.embed_dim)
for layer in self.layers:
X_dec = layer(X_enc, X_dec, enc_mask, dec_mask, training=training)
X_dec = nn.Dense(self.vocab_size, name='final')(X_dec)
return X_dec
Checking the size of these models using the same hyperparameters as we did with the Encoder
… 91 thousand parameters!
def layer_fn():
return DecoderLayer(hidden_dim=13,
attn_dropout=0.1,
fc_dropout=0.1,
n_heads=7,
size_per_head=17)
mdl = Decoder(pos_encoding=sin_pos_enc,
vocab_size=len(vocab),
embed_dim=2 * 3 * 5,
layers=[layer_fn() for _ in range(3)])
batch = strs2ids('hey', 'there', 'ma', 'dood')
kv = strs2ids('i', 'really', 'enjoy', 'algorithms')
enc_mask = (kv == vocab['<pad>'])
kv = nn.one_hot(kv, len(vocab))
params = mdl.init(key, kv, batch, enc_mask)
print(f'{num_params(params) = }')
del layer_fn, mdl, batch, kv, enc_mask, params
num_params(params) = 91291
Transformers come in three main flavors.
Decoder only transformers remove the middle multi-head attention (the cross-attention) layer as there is nothing to cross with.
And as we’ll see in a minute, they can be used to compute rot13 encryption!
class EncoderDecoderTransformer(nn.Module):
pos_encoding: Callable[[int, int], jnp.array]
in_vocab_size: int
out_vocab_size: int
embed_dim: int
n_layers: int
hidden_dim: int
attn_dropout: float
fc_dropout: float
n_heads: int
size_per_head: int
def setup(self):
self.encoder = Encoder(
pos_encoding=self.pos_encoding,
vocab_size=self.in_vocab_size,
embed_dim=self.embed_dim,
layers=[EncoderLayer(hidden_dim=self.hidden_dim,
attn_dropout=self.attn_dropout,
fc_dropout=self.fc_dropout,
n_heads=self.n_heads,
size_per_head=self.size_per_head,
name=f'encoder_{i}')
for i in range(self.n_layers)])
self.decoder = Decoder(
pos_encoding=self.pos_encoding,
vocab_size=self.out_vocab_size,
embed_dim=self.embed_dim,
layers=[DecoderLayer(hidden_dim=self.hidden_dim,
attn_dropout=self.attn_dropout,
fc_dropout=self.fc_dropout,
n_heads=self.n_heads,
size_per_head=self.size_per_head,
name=f'decoder_{i}')
for i in range(self.n_layers)])
def __call__(self, X, Y, source_mask, *, training=False):
# required for dot product attention
chex.assert_equal(self.encoder.embed_dim, self.decoder.embed_dim)
encodings = self.encoder(X, source_mask, training=training)
self.sow('intermediates', 'encodings', encodings)
return self.decoder(encodings, Y, source_mask, training=training)
A tiny EncoderDecoder
model has over 140 thousand parameters. And we’re not even trying yet.
mdl = EncoderDecoderTransformer(
pos_encoding=sin_pos_enc,
in_vocab_size=len(vocab),
out_vocab_size=len(vocab),
embed_dim=2 * 3 * 5,
n_layers=3,
hidden_dim=13,
attn_dropout=0.1,
fc_dropout=0.1,
n_heads=7,
size_per_head=3
)
X = strs2ids('hey', 'there', 'ma', 'dood')
y = strs2ids('i', 'really', 'enjoy', 'algorithms')
mask = (X == vocab['<pad>'])
params = mdl.init(key, X, y, mask)
print(f'{num_params(params) = }')
del mdl, X, y, mask, params
num_params(params) = 31903
We’re going to train our transformer to encrypt words via rot13. Rot13 is an old-school encryption algorithm where each character is shifted by 13 characters (see below).
Training a transformer to do rot13 is a bit like using a chainsaw to give an injection but it’s simple so it’s well suited to purpose.
Since there are 26 letters in the English alphabet, rot13 is its own inverse! That means if you encode a message with rot13 twice, you get back the original message.
def rot13(input_string):
return ''.join([chr(((vocab[x] + 13) % 26) + 97) for x in input_string])
a = 'asdfqwerz'
print(a, '=>', rot13(a), '=>', rot13(rot13(a)))
del a
asdfqwerz => nfqsdjrem => asdfqwerz
Let’s write our data generator.
def get_data(key):
k0, k1 = jran.split(key, 2)
max_len = 15
X = jran.randint(k0, (50, max_len), 0, len(vocab) - 2)
mask = jnp.stack([jnp.arange(max_len) >= i for i in jran.randint(k1, (50,), 1, max_len)])
X = X * (1 - mask) + (mask * vocab['<pad>'])
Y = ((X + 13) % (len(vocab) - 2)) # cheap version of rot13 at the encoded level
Y = (1 - mask) * Y + mask * vocab['<pad>']
Ys = (
jnp.ones_like(Y, dtype=jnp.int32)
.at[:, 1:].set(Y[:, :-1])
.at[:, 0].set(vocab['<start>'])
)
return (X, Ys, mask.astype(jnp.float32)), Y
mdl = EncoderDecoderTransformer(pos_encoding=sin_pos_enc,
in_vocab_size=len(vocab),
out_vocab_size=len(vocab),
embed_dim=8,
n_layers=1,
hidden_dim=5,
attn_dropout=0.0,
fc_dropout=0.0,
n_heads=7,
size_per_head=5)
opt = optax.chain(
optax.clip_by_global_norm(1),
optax.sgd(
learning_rate=optax.warmup_exponential_decay_schedule(
init_value=0.5, peak_value=0.8, warmup_steps=100,
transition_steps=200, decay_rate=0.5,
transition_begin=100, staircase=False, end_value=1e-3
)
)
)
params = mdl.init(key, *get_data(key)[0])
print('num_params: ', num_params(params))
opt_state = opt.init(params)
num_params: 4665
One nice thing about jax is that you don’t compile a model, you compile the whole training loop (which, in our case includes data generation).
@jax.jit
def train_step(params, opt_state, step, key):
"""Train for a single step."""
k0, k1 = jran.split(jran.fold_in(key, step))
args, y = get_data(k0)
@jax.grad
def grad_fn(params):
logits = mdl.apply(params, *args,
training=True, rngs={'dropout': k1})
loss = optax.softmax_cross_entropy_with_integer_labels(
logits, y
).mean()
return loss
grads = grad_fn(params)
updates, opt_state = opt.update(
grads, opt_state, params)
params = optax.apply_updates(params, updates)
return params, opt_state
We’ll the 10,000 train steps (which takes about 3 minutes on my laptop)…
for step in tqdm(range(10_000)):
params, opt_params = train_step(params, opt_state, step, key)
Now let’s run the test.
X = strs2ids('hey', 'there', 'ma', 'dood')
start = jnp.array([[vocab['<start>']]] * X.shape[0], dtype=jnp.int32)
Y = start
while (Y[:, -1] != vocab['<pad>']).any():
Y = jnp.argmax(mdl.apply(params, X, jnp.concatenate([start, Y], axis=-1), X == vocab['<pad>']), axis=-1)
ids2strs(list(Y))
['url', 'gurer', 'zn', 'qbbq']
[rot13(x) for x in ids2strs(list(Y))]
['hey', 'there', 'ma', 'dood']
And that’s all folks! You can now transform with the best of ‘em!
This fact is very quickly becoming outdated. ↩
In the last post we covered the definition of places and valuations (sometimes called valuation rings). We covered how there is a one to one correspondence between valuations and places (up to equivalences). That was the entirety of chapter 1 section 1. This post will be about chapter 1 section 2.
Let $\Gamma$ be a multiplicative commutative group. If there exists a multiplicatively closed subset $S\subset \Gamma$ that does not contain the group’s $1$ element and has the property that for each $x$ in $\Gamma$ either $x\in S$ or $x^{-1}\in S$1, we can define a strict total order on $\Gamma$ as:
\[a \lt b \iff ab^{-1}\in S\]Just to refresh everyone, a total order is a relation with 4 properties,
Let’s prove that this is, indeed, a strict total order.
There is, in fact, another property that Lang mentions along side some of the order properties2:
This is essentially the same proof used for transitivity (above), so I’ll omit it.
We now add a 0 element to the mix, such that for all $a\in \Gamma$, $0a = 0$, and $0 \lt a$.
We now (re?)define “valuations”. I guess they’re making a distinction between valuations and valuation rings (or maybe Lang is just adding another equivalent perspective). Either way, in this section we define it seemingly for the second time.
Let $K$ be a field, $\Gamma$ an ordered commutative multiplicative group, a valuation on $K$ is a map $v: K\to \Gamma$ (sometimes abbreviated as $v(a) = |a|$) with the following properties:
The following are noted additional properties (not part of the definition, but provably true):
$|\pm a| = |a|$ “proof is obvious”, but we don’t play that…
$1 = |1| = |(-1)^2| = |-1|^2$. Since $-1\notin\Gamma$, $|-1| = 1$.
$|-a| = |-1\cdot a| = |-1|\cdot |a| = |a|$
If $|a| \lt |b|$, then $|a + b| = |b|$ (see book for proof). What I will include here is that the proof rests on the following (not entirely intuitive) result4:
If $|a| \lt |b|$, then $|a| \le |a + b |$.
But
\[\begin{align*} |a| =& |a + b - b| \\ \le& \max(|a + b |, |b|) \\ \le& \max(\max(|a|, |b|), |b|) \\ =& |b| \end{align*}\]In any sum: $a_1 + a_2 + \dots + a_n = 0$, at least two elements of the sum (summands?) have the same value. We can first prove that for any sum where each summand has distinct valuations, the valuation of the sum is the maximum of the valuations of the terms. We know this is true for $n=2$, so let’s do the inductive thing and show $n \implies n+1$. Well,
\[\begin{align*} |a_1 + \dots a_n + a_{n+1}| =& |(a_1 + \dots a_n) + a_{n+1}| \\ =& \max(|a_1 + \dots a_n|, |a_{n+1}|) \\ =& \max(\max(|a_1|, \dots, |a_n|), |a_{n+1}|) \\ =& \max(|a_1|, \dots, |a_n|, |a_{n+1}|) \end{align*}\]So if the our given sum equals 0, either the max valuation is 0 (then all terms have valuation = 0), or the premise that all terms have distinct valuations is false.
The requirement used in the textbook is that $\Gamma = S \cup 1 \cup S^{-1}$ (disjoint). The phrasing I used has the benefit of relating directly to the previous section, and is equivalent, so I used it. The equivalence is a consequence of that $S$ is multiplicatively closed and $1\notin S$ means $x,x^{-1}$ cannot both be in $S$. ↩
It’s strange that Lang includes this property along side two of the other ordering properties, but I’m sure he knows what he’s doing, so I’ll include it and see where it goes. ↩
This is not true in most orderings over fields. For instance, take $\mathbb{Q}$ or $\mathbb{R}$, and let $\gamma = -1$ or $\gamma = 0$. It is true if you restrict $\gamma \gt 0$, though. ↩
This is trivially true for norms, obviously, but we’re using that notation for valuations, which aren’t entirely the same (in fact, rule 3 contradicts the triangle inequality of norms). ↩
I haven’t really studied any Math seriously since I graduated with my undergrad in 2015. It seems I kinda miss it, so I’ll be going over Introduction to Algebraic Geometry by Serge Lang today.
It starts out with the theory of Places. The idea being that homomorphisms between fields are necessarily trivial. Before we prove that, we define a field homomorphism as a function $\varphi:F\to K$ such that:
Axiom 3 only guarantees that the homomorphism doesn’t send every element to 0.
It’s been a long time since I’ve done any math, so let’s warm up a bit with a direct, computational proof.
Proof:
First we prove $\forall x \neq 0, \varphi(x) \neq 0$. To that end, let’s assume toward a contradiction that there exists an $x \neq 0$ such that $\varphi(x) = 0$. Then…
\[\begin{align*} 0 =& \varphi(x) \\\\ 0 \cdot \varphi(x^{-1}) =& \varphi(x) \cdot \varphi(x^{-1}) \\\\ 0 =& \varphi(x \cdot x^{-1}) \\\\ 0 =& \varphi(1) \\\\ 0 =& 1 \end{align*}\]Next, inverses necessarily get sent to inverses, although you may have noticed that we’ve already proven it above.
\[\begin{align*} 1 =& \varphi(1) \\\\ =& \varphi(x \cdot x^{-1}) \\\\ =& \varphi(x) \cdot \varphi(x^{-1}) \end{align*}\]Okay, now let’s prove it’s injective:
\[\begin{align*} \varphi(x_0) =& \varphi(x_1) \\\\ \varphi(x_0) \cdot \varphi(x_1)^{-1} =& \varphi(x_1) \cdot \varphi(x_1)^{-1} \\\\ \varphi(x_0) \cdot \varphi(x_1^{-1}) =& \varphi(x_1) \cdot \varphi(x_1^{-1}) \\\\ \varphi(x_0 \cdot x_1^{-1}) =& \varphi(x_1 \cdot x_1^{-1}) \\\\ 1 - \varphi(x_0 \cdot x_1^{-1}) =& 1 - \varphi(1) \\\\ \varphi(1 - x_0 \cdot x_1^{-1}) =& 0 \\\\ 1 - x_0 \cdot x_1^{-1} =& 0 \\\\ 1 =& x_0 \cdot x_1^{-1} \\\\ x_1 =& x_0 \end{align*}\] \[\blacksquare\]Okay, so now that our mathematical juices are starting to flow, let’s try a more mature mathematical approach… The proof is built upon the theory of rings and ideals. So, let’s define them.
A ring is a set $R$ along with a commutative operation $+$, a $0$ element, inverses for all non-zero elements. There is also a $\cdot$ operation where inverses are not necessarily defined.
An ideal $I \lt R$ is a subset of $R$ such that
Proof:
The kernel of a ring homomorphism1 is an ideal2, and the only ideals of a field are $\{0\}$ and $F$ itself. Since $\varphi(1) = 1$, we know the kernel is not $F$, and hence $\varphi(x) = 0 \iff x = 0$.
Yay! We’re real mathematicians now! Let’s be happy and all drink.
Mathematicians are really running out of names for things… But places are the topic of chapter one, so let’s do it…
Let $F$ be a field. We can extend $F$ to include the notion of $\infty$ by adding it to the field with the following definitions for the different operations. Let $a\in F_{\neq 0}$.
\[\begin{align*} a \pm \infty =& \infty \\\\ a \cdot \infty =& \infty \\\\ 1 / \infty =& 0 \\\\ 1 / 0 =& \infty \\\\ \infty \cdot \infty =&\infty \end{align*}\]Note: $\infty \pm \infty$, $0\cdot \infty$, $\infty / \infty$, $0 / 0$ are undefined.
Let, $K,F$ be fields, then an $F$-valued place is mapping $\varphi:K\to F\cup\{\infty\}$ that satisfies the field homomorphism axioms (whenever defined).
The set, $\nu \subset K$, of elements that don’t get mapped to $\infty$ (called the finite elements of $\varphi$) form a ring. This can be seen because $+$ and $\cdot$ are extended to include $\infty$, but not overridden for finite elements.
The kernel is an ideal of $\nu$, and in fact, $\nu/\text{ker} \cong \varphi(\nu)$ is a field. This can be seen because if $\varphi(x)$ is finite, then $\varphi(x^{-1})$ is finite too (by axiom 2 of homomorphisms). It is also a consequence of the kernel being a maximal ideal of $\nu$3.
A valuation ring of $K$ is a ring $R\subseteq K$ such that $\forall a \in K, a\notin R \implies a^{-1}\in R$. $\nu$ is a valuation ring of $K$. It turns out, there is a 1-to-1 correspondence between valuation rings and places. To prove this, we only need to show that the non-units of $R$ form a maximal ideal, then we’ll be able to construct a map between $K$ and $(K/$ker$(\varphi))\cup\{\infty\}$ that makes this a place.
Let $a, b\in R$ be non-units, and let $\lambda \in R$ be arbitrary.
So valuation rings and places are somewhat interchangeable (at least up to equivalence of places4). And now we’re caught up to half-way down page 3.
Ring homomorphisms follow the same axioms as field homomorphisms. ↩
if $\varphi(a) = 0$ and $\varphi(b) = 0$, then axiom 1 says, $\varphi(a + b) = 0$, so axiom 1 of ideals is satisfied. Similar reasoning gives you axiom 2 of ideals. ↩
A ring modded by a maximal ideal is a field. ↩
Two places $\varphi_0:K\to F_0\cup\{\infty\}$ and $\varphi_1:K\to F_1\cup\{\infty\}$ are equivalent if there exists an isomorphism $\lambda:F_0 \to F_1$ such that $\varphi_1 = \lambda\varphi_0$. We define $\varphi(\infty) = \infty$. ↩
You’re standing on the surface of the earth (as opposed to whatever your usual hangouts are), and walk a mile south, a mile east, and then a mile north and end up where you started. You see a bear. What color is the bear?
You’ve heard this one before, but just in case, the answer is: White. You are at the north pole and it was a polar bear. But that’s not the only point that it could be. The riddle is, can you describe the full set of solutions?
Pick any line of longitude where the length of the line is $1/n$ miles long, then go a mile north of that and you have a solution. In fact, this set is dense around the ring with circumference of 1 mile.
]]>Then starting with $1$ and going around the circle in turn, each person shoots the person directly to their left. This stops when there’s only one person left alive. For example: if $N=5$, it would go:
and only 3 remains.
Given $N$, what is the index (1-indexed, unfortunately) of the last shooter standing? For example: if $N=5$, the answer is $3$.
Every number $N$ can be written uniquely as $2^i + \lambda$, where $i,\lambda\in\mathbb{N}_{\geq 0}$ and $\lambda \lt 2^i$ (hence $\lambda$ is minimal and $i$ maximal).
The answer to the riddle is $\equiv 1 + 2\lambda \pmod{N}$ for this particular $\lambda$.
So full disclosure, I kinda cheated on this one by writing a program to “prove” my thesis before I did so mathematically. We’ll get to that in a bit; but first let’s go through the motions and write out this solution.
Firsty, we can immediately tell that every time there are an even number of people in the circle, that round will end up at the same person it started with (as in if there are 6 people in the circle, then 2, 4, and 6 are killed in the first round and we start round two on 1 again.
From there we can conclude that any power of two should end on 1. e.g.
if $N = 2^1 = 2$ we have:
if $N = 2^2 = 4$
If $N=8$,
and we’re left in pretty much the same scenario as $N=4$. In fact, for $2^{i+1}$, the first iteration will wipe out half of the people and leave us starting on 1 again – thereby leaving us in an analogous scenario as $2^i$.
In the case that $\lambda \gt 0$, we know that once $\lambda$ people have been shot, we’ll have $2^i$ people left. So whoever is the first shooter after $\lambda$ people have been shot will be the winner. The task now is to find out who is shooter number $\lambda + 1$. Since each shot progresses to the next odd number shooter (1 shoots 2 and 3 is next, etc.), we know it should be:
\(2(\lambda + 1)-1 = 1 + 2\lambda\)
def rotate(lst, n=1):
"rotate a list n times sending the first elems to the back"
return lst[n:] + lst[:n]
def shoot(lst):
lst = rotate(lst)
lst.pop(0)
return lst
def play(N: int) -> int:
players = list(range(1, N+1))
for _ in range(N - 1): # n-1 shots leaves one standing
players = shoot(players)
assert len(players) == 1, "Wrong number of players left"
return players[0]
def get_lambda(N: int) -> int:
i = int(np.floor(np.log2(N)))
return N - (2 ** i)
def check_theory(i):
return play(i) == (2 * get_lambda(i) + 1)
t0 = time()
for i in range(1, 1000):
assert check_theory(i)
print(time() - t0)
And we can confirm that the relationship holds, at least for the first 1000 positive integers. To do more, we’ll have to account for python’s speed issues.
from numba import jit
@jit(nopython=True)
def _rotate(lst, n=1):
"rotate a list n times sending the first elems to the back"
return lst[n:] + lst[:n]
@jit(nopython=True)
def _shoot(lst):
lst = _rotate(lst)
lst.pop(0)
return lst
@jit(nopython=True)
def _play(N: int) -> int:
players = list(range(1, N+1))
for _ in range(N - 1): # n-1 shots leaves one standing
players = _shoot(players)
assert len(players) == 1, "Wrong number of players left"
return players[0]
@jit(nopython=True)
def _get_lambda(N: int) -> int:
i = int(np.floor(np.log2(N)))
return N - (2 ** i)
@jit(nopython=True)
def _check_theory(i):
return _play(i) == (2 * _get_lambda(i) + 1)
t0 = time()
for i in range(1, 1000):
assert _check_theory(i)
print(time() - t0)
And now we run our horse race!
def race(func, max_n, num_races):
for _ in range(num_races):
t0 = time()
for i in range(1, max_n + 1):
assert func(i)
print(time() - t0)
print('no numba')
race(check_theory, 1000, 5)
On my machine, that prints something on the order of 1 second for each of the five runs. Now let’s try numba!
print('yes numba')
race(_check_theory, 1000, 5)
This one takes about 1.5 seconds for the first one, then consistently half a second for the rest.
The point being, if you’re going to run a function a bunch of times, it’s probably worth while letting numba have a go at it – it’s super cheap to do after all.
There are $N$ doors indexed by $1,..,N$ along a wall and they’re all closed. Then for each $i\in{1,2,\dots,N}$, you toggle every $i$th door. So at first ($i = 1$) you toggle every door. Then $i=2$ you toggle every second door – the even doors ($2, 4, 6, \dots,$), and so on.
For example, if $N=4$, then the following is the sequence of events:
It results in 1 and 4 are open, while 2 and 3 are closed.
Given $N$, describe the “door is closed” function. By that I mean, which doors are open and which are closed at the end?
Only the perfect squares are open at the end. So $1,4,9,16,\dots$ are open, and $2,3,5,6,7,8,\dots$ are closed.
NOTE: Spoilers below!
So there are really two philosophies to solving this riddle, and they each have their merits.
The first philosophy I’m talking about is what I’ll call the theory based approach. For this approach we’ll investigate what’s driving the system. By that we mean, given a number $m\leq N$, how many times will that $m$ be toggled? Note that we’re not investigating the task directly, but rather something a lot deeper and more specific.
Well, a number $m$ is toggled exactly once per divisor. For instance, the number $6$ is toggled 4 times – once at $1,2,3,6$ – and hence is closed at the end. $9$, on the other hand only gets toggled $3$ times – $1,3,9$ – and hence is open at the end.
So how many divisors does a given number have? For that, we’ll use some beautiful number theory (the number theory we’re using isn’t deep or anything, but any number theory is beautiful really). So any counting number $m$ can be written as a product of unique primes to non-negative integer exponents. So, mathematically, what I’m saying is:
\[\begin{align*} m =& \prod p_i^{\alpha_i} \end{align*}\]Since every number can be written that way, all $m$’s divisors can be written that way too. By definition, any divisor of $m$ must be $\prod p_i^{\lambda_i}$ such that for each $i$, $0 \leq \lambda_i \leq \alpha_i$ – and each such collection of $\lambda_i$ determines a unique divisor. So how many such numbers can we create? Clearly: $\prod\limits_i(\alpha_i + 1)$ because we can have up to $\alpha_i$ powers of $p_i$, but we could also have 0 of them. So there are $\alpha_i + 1$ many.
So now we can ask which numbers have an odd number divisors? Those will be the ones left open and the others will be closed. Well an odd number of divisors means that $\gamma = \prod\limits_i(\alpha_i + 1)$ is odd. But since $\gamma$ is odd, then none if its divisors can be even, and hence $\alpha_i + 1$ must be odd for each $i$. But that means $\alpha_i$ must be even for each $i$. And that’s the exact description of a perfect square.
Another way to use this understanding is that for every divisor there’s a pair, a buddy. But only if it’s a perfect square will there be a divisor whose buddy is itself, thereby making an odd number of divisors. But this is still based on our theoretical understanding of the scenario.
The second approach is one that’s particularly useful if the theory is very costly to investigate (for whatever reason). For this approach we’ll just collect a bunch of data, and see what fits. For this example, we’ll collect just the data that’s directly what we’re looking to question. Let’s do up to 11 (just to pick a random number) letting 0 denote closed and 1 open.
iteration | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | NOTES |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
2 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | |
3 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | |
4 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | |
5 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | transcription error |
6 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | |
7 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | |
8 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | |
9 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | |
10 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | |
11 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | |
12 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
At this point we might already have noticed a pattern. The first is 1, then $1 + 3 = 4$. Then $4 + 2\cdot 3 = 10$. So the next one should be $10 + 3 \cdot 3 = 19$, which we know from the first approach is wrong, so we collect more data and reject this hypothesis. We can imagine that with enough data we’d start to notice that 9 and 10 are outliers and even probably notice the transcription error at iteration 5 and end up with something like this.
index | open? |
---|---|
1 | 0 |
2 | 1 |
3 | 1 |
4 | 0 |
5 | 1 |
6 | 1 |
7 | 1 |
8 | 1 |
9 | 0 |
10 | 1 |
11 | 1 |
12 | 1 |
13 | 1 |
14 | 1 |
15 | 1 |
16 | 0 |
And we conclude that the pattern is 1, then skip 3 (to 4), then skip 5 (to 9), then skip 7 (to 16), etc. and confirm it with some out-of-sample:
index | open? |
---|---|
17 | 1 |
18 | 1 |
19 | 1 |
20 | 1 |
21 | 1 |
22 | 1 |
23 | 1 |
24 | 1 |
25 | 0 |
26 | 1 |
27 | 1 |
28 | 1 |
29 | 1 |
30 | 1 |
31 | 1 |
32 | 1 |
33 | 1 |
34 | 1 |
35 | 1 |
36 | 0 |
37 | 1 |
and I’m satisfied. But that pattern isn’t very illuminating, and it’s only valid in the statistical sense. But either way, we get an answer1.
What if we work in a field where data accrual is very slow (think Finance, Economics, or any such field). Clearly this approach has some pretty deep flaws for us. Our out-of-sample grows too slowly to be able to make conclusions like we made at the end there, and we’re stuck in the first realm where all data has errors and there are a plethora of theories that look great (in sample). And even if we hold out some data as “out-of-sample” and try to do everything right, we can still over-fit to that sample.
Luckily for us, there’s a lot of autocorrelation in these things. So if you’ve done everything properly and found something that works on the hold-out, it’s shouldn’t be too surprising if it works well for at least a short period on true out-of-sample, even if the thing is over-fit, but that’s not particularly satisfying.
So how else should we go about this? The economist’s approach, perhaps? It seems sensible at first glance; you get to build off of the tremendous work of the mathematical community and potentially get conclusive answers. We can certainly model things with increasing complexity and beauty (unrelated), but at the end of the day the only way to assess the validity of these models is through out-of-sample predictions. You know, the way Physicists do it. But we can’t really do that here, so we do like we said we would and have a hold out, etc. and we’re already back to the data-science approach but with more story-telling. And naturally, since we want to pretend like we’re doing something more rigorous than data-science, we judge this largely on the sensibility of the stories. But all that means is that we’re unwilling to accept theories that are counter-intuitive – sorry Einstein, your theory of relativity isn’t wanted here.
So either we do the data science approach, or we do the “sensibility” based approach, which is itself based on itself. Why do people prefer the sensibility based approach? If you’re reading this, please really do email me your thoughts on this – I’d love to hear other opinions here.
We can show that these two solutions are equivalent by induction. Clearly it’s true for 1 (1 is a perfect square and is how we defined the start of the sequence). Now imagine a square of size $n$ and let’s see how many we have to add to get to a square with side-length $n+1$. Well, we have to add $n$ to each side, then 1 for the corner. So we add $2n + 1$. So for instance, if $n=1$ (the base case) then the next perfect square would be $n + (2n+1) = 1 + 3 = 4$. Indeed! ↩
If you’ve ever read any papers referring to quantile regression (QR), you undoubtedly have read “MSE yields means, MAE yields medians, and QR yields quantiles”, but what does that even mean? Like not just hand-wavy nonsense, but really, what does that mean? After all, these are just loss functions and therefore don’t yield anything at all. Even the regressions yields coefficients, or, more generally, models; so what’s with the mean and median talk? It turns out they’re talking about the distribution of residuals1. In this post, we’ll show what they mean, how that’s the case, and we’ll investigate a bit more about QR.
Given a scaler $q\in[0,1]$, let $l_q:\mathbb{R}\to\mathbb{R}_+$ be:
\[\begin{align*} l_q(r) =& r\cdot\left(q - \mathbb{I}_{r < 0}\right) \\ =& \begin{cases} qr & r > 0 \\ (q-1)(r) & r \leq 0\end{cases} \\ =& \begin{cases} qr & r > 0 \\ (1-q)(-r) & r \leq 0\end{cases} \\ =& \begin{cases} q\|r\| & r > 0 \\ (1 - q)\|r\| & r \leq 0\end{cases} \end{align*}\]We define the Quantile Loss with quantile $q$ as the function $L_q:\mathbb{R^n}\to\mathbb{R}$ from the residuals to the loss as:
\[\begin{align*} L_q(r) =& \sum\limits_{i=0}^{n-1}l_q(r_i) \end{align*}\]Basically, the loss is very much like MAE (A.K.A. $L_1$ loss) but we weight the absolute errors based on whether they are above or below zero. We can see by inspection that $L_{0.5} = MAE$. This is looking good, because if $L_q$ yields quantiles, MAE yields medians, and the $0.5$ quantile is the median, than these two should be the same.
A natural question to ask at this point is, when are you getting to your point? Er, I mean, what does this have to do with quantiles?
To answer this, let’s investigate, when will the derivative of $L_q$ equal zero (a necessary condition for a minima). To make our lives easier, let $\omega_0$ be the number of residuals that are greater than zero, and $\omega_1$ be the number of those that are less than or equal to zero.
\[\begin{align*} 0 =& \frac{\partial L_q}{\partial r}L_q(r) \\\\ =& \frac{\partial L_q}{\partial r}\left(\sum_{r\gt 0}qr + \sum_{r \le 0}(1-q)(-r)\right) \\\\ =& \sum_{r\gt 0}q + \sum_{r \le 0}(q - 1) \\\\ =& q\omega_0 + (q - 1)\omega_1 \\\\ =& -\omega_1 + q(\omega_0 + \omega_1) \iff \\\\ \omega_1 =& q(\omega_0 + \omega_1) \iff \\\\ q =& \frac{\omega_1}{\omega_0 + \omega_1} \end{align*}\]So we can see that the Quantile Regression optimization problem attempts to over-estimate $q$ percent of the observations. In particular, if $q = 0.5$, then the QR loss attempts to ensure that there are just as many over-estimations as under-estimations.
But if $q=0.9$, for instance, then this loss will attempt to ensure your model under-estimates about 90% of the observations and over-estimates about 10% of them.
The Mean Squared Error loss (MSE) is defined as the function $L:\mathbb{R}^n\to\mathbb{R}_+$ such that:
\[\begin{align*} L(r) =& \frac{1}{2n}\sum\limits_{i=0}^{n-1}r_i^2 \\ \end{align*}\]We include the 2 in the denominator to make the next part easy. But since this is a constant across all observations, it wouldn’t affect the results if we’d left it out.
So we see that much like how QR yields quantiles, MSE yields means.
In practice, many things can affect how close to the actual quantile you get. For example, sampling, the number of iterations you allow the optimizer to take, etc. can all affect the accuracy of the optimization. Also, it’s useful to note that these optimizations are all in-sample, so any results should only be trusted insofar as how similarly distributed your live data will be to the training data.
Quantile regression attempts to distribute the residuals such that your model with under predict with probability $q$.
I find their phrasing to be particularly misleading. A regression does not yield the distribution of residuals. That’s a characteristic of the fitting process. The regression yields coefficients – or more generally, a model. Hopefully if you have the same issue I had, this post will help clarify things. ↩
The short version of this whole thing is that we can see the gradient boosting algorithm as just gradient descent on black-box models.
Typically gradient descent involves parameters and a closed form solution to the gradient of your learner as a way to update those parameters. In the case of trees, for instance, there are no parameters you want to update1, and somewhat more importantly they aren’t even continuous predictors, much less differentiable!
But gradient boosting can be seen as gradient descent on black box models resulting in an additive black box model.
A description of the algorithm I’m talking about can be found on wikipedia, but I’ll go over the algorithm somewhat quickly here just so we’re on the same page. The basic idea is that Gradient Boosting can be applied to any additive model (black-box or otherwise), and not just trees; although you will generally see this applied on trees. We won’t be assuming any particular form of model in this post, but we may occasionally add information specific to trees because of their ubiquity.
Note: an iteration in an additive tree based model is a new tree. So when we say things like train a new iteration, we mean train a new tree, if trees are your model.
At each step, you train a new iteration of the model where the target is the gradient of your loss with respect to the predictions of the previous iteration. You then ensemble those iterations based on the pre-defined learning rate and their losses.
Let’s assume we have a feature-set $X$, a response variable $y$, a loss function $L(\hat y, y)$, and a learning rate $\lambda\in\mathbb{R}_+$ such that $0 \lt \lambda \leq 1$. The algorithm is as follows:
First we define $y_0 = f_0(X) = 0$.
Now for $i \geq 1$ (and less than our stopping criteria2), we let:
\[\begin{align*} y_i =& -\frac{\partial{L(y, f_{i-1}(X))}}{\partial f_{i-1}(X)} \\ =& -\frac{\partial{L(y, y_{i-1})}}{\partial y_{i-1}} \end{align*}\]The key here is that the form of $f$ is irrelevant. The target for the next iteration is the gradient of the loss with respect to the most recent predictions.
We then train a new iteration of the model such that:
\[\begin{align*} g_i(X) \sim& y_i \end{align*}\]We then find the real number scaler that minimizes our loss in the following equation (AKA a line-search):
\[\begin{align*} \gamma_i =& \text{argmin}_\gamma(y - (f_{i-1}(X) + \gamma g_i(X))) \end{align*}\]Finally we define:
\[\begin{align*} f_i(X) =& f_{i-1}(X) + \lambda\gamma_i g_i(X) \end{align*}\]So that
\[\begin{align*} f(X) =& \sum\limits_{i=0}^N \lambda\gamma_i g_i(X) \end{align*}\]In the special case where our loss function is mean squared error (or $L2$), our gradients are just the residuals.
Given a model $f_i$, we want to construct $f_{i+1}$, and we’ll do that like so:
Given a differentiable model $f$ parameterized by parameter vector $\beta$, we want to find the $\hat\beta$ that minimizes the loss. We’re going to do that by following the gradient. More rigorously, given $X,y,\beta_i,\lambda$, we want to estimate $\beta_{i+1}$. The way we do that is:
This outline is exactly the same as the Gradient Boosting outline! The only difference is, instead of adding the models, we add the $\beta$s. This is, in fact, a superficial difference. A linear approximation of a gradient is the gradient itself, and the sum of linear models is the linear model of the sum4. In that light, traditional gradient descent is just a special case of gradient boosting!
Since we’re modeling the gradient with an arbitrary (potentially black-box) learner, we don’t have the option to find the gradient with respect to the parameters, so the scale might not decrease as desired. To exemplify this, let’s consider an $L_1$ objective (Mean-Absolute-Error), and a black-box learner. The gradient at each point is either 1, -1, or np.nan
(because the absolute value function is $f(x) = \pm x$ depending on $x$). The magnitude of the gradients will never change5. In a linear model we have that extra $\frac{\partial f}{\partial \beta}$6 which adds scale to our gradient, but trees have no such thing. It’s for this reason we generally don’t have a magnitude-based stopping criterion, but rather opt for an explicit max-iterations.
One can also sub-sample (as is a parameter in popular packages like LightGBM
). Sub-sampling is the black-box model version of the familiar Stochastic Gradient Descent.
Gradient Descent is essentially an optimization on Gradient Boosting, under the assumption that your function is parameterized by a continuous parameter vector, and that the function is differentiable with respect to that parameter vector. Under those assumptions, we can greatly simplify the computation and we get Gradient Descent.
Technically the split leaves in a tree define an indicator function on your data and the average value within a leaf (the prediction for that leaf) can be seen as the parameters of a tree, but this is kind of ridiculous because these are not tuned in the learning of the tree and there’s really no reason to do so (as far as I can tell). ↩
One generally explicitly sets a maximum number of estimators as the stopping criterion, but it doesn’t have to be so. One could imagine setting a criteria based on the magnitude of the gradients or something like that. ↩
In practice we generally don’t include the line search and just have a decreasing $\lambda$ – and sometimes we don’t even do that. We can get away with these shortcuts because the magnitude of the gradients will decrease as you get closer to the optima, and the derivative with respect to $\beta$ is always continuous. The same cannot be said about the gradient boosting algorithm. ↩
If we call our final linear model: $X\hat\beta$, then $X\hat\beta = X\left(\sum\limits_{i=0}^N\lambda\gamma_i\alpha_i\right) = \sum\limits_{i=0}^N\lambda\gamma_iX\alpha_i $. So we can see that linear regression has always been constructed as a sum. ↩
Neglecting the case where you get a perfect fit for a meaningful amount of data, the magnitude of the residuals will always be $|-1|$ or $|1|$. ↩
This comes from the chain rule when computing the derivative of the loss with respect to the parameters. ↩
Earlier today I asked a colleague what the formal definition of a Unit Factor Portfolio is. He told me “it’s a portfolio with unit exposure to the factor, and orthogonalized to all the others.” That didn’t make sense to me, so I’m going to dig in here.
We (quants) tend to use some form of a risk model to optimize our portfolios1 given a vector of alphas2. This post assumes that you know what a risk model is. Today we’ll be going over risk models from a geometric point of view.
A continuous, real valued, random variable3 $X$ can be seen as a vector in $\mathbb{R}^n$, where $n$ is the number of observations you have. So if I take 7 samples of a normally distributed random variable, it might look like this:
>>> import numpy as np
>>> rnd = np.random.RandomState(42)
>>> rnd.normal(size=(7,))
array([ 0.49671415, -0.1382643 , 0.64768854, 1.52302986, -0.23415337,
-0.23413696, 1.57921282])
As you can see, this is just a vector in $\mathbb{R}^7$.
If you’ve spent time in a university’s Math department, you may have heard someone saying, “as long as we have an inner-product4 defined, we have a geometry”. What that means is: if you have an inner product defined, we can define distances and angles using that inner product. Here is how we can define a norm (measure of “size”), distance between two points, and an angle using only inner products (letting $\theta$ be the angle between two vectors $u,v$):
\[\begin{align} \left\| u\right\| =& \sqrt{\langle u, u\rangle} \\ d(x, y) =& \left\| x - y \right\| \\ \cos(\theta) =& \frac{\langle u, v\rangle}{\|u\|\cdot\|v\|} \end{align}\]The covariance between two random variables, $X,Y \in \mathbb{R}^n$, is defined as:
\[\begin{align*} \text{cov}(X,Y) =& \sum\limits_{i=0}^{n}(X_i - \mu_X)(Y_i - \mu_Y) \\ =& \langle X-\mu_X, Y - \mu_Y\rangle \end{align*}\]In fact, if we have a covariance matrix defined and we want to compute the covariance between two vectors, the formula changes to $\rho(u,v) = u’Vv$, which is still an inner product as long as $V$ is positive-definite5. In this light, the covariance matrix $V$ is seen as a linear isomorphism $V:\mathbb{R}^n\to\mathbb{R}^n$. Basically, it’s just reshaping the space to more accurately represent the covariance structure we see. $Vv$ is the transformation of $v$ into something that makes a bit more geometric sense than the coordinate structure we started with.
So what does the positive-definite part mean geometrically? Well, that means that all the eigenvalues are positive, but what does that mean? That means that if you look at the vectors before and after transforming them, non of them have completely flipped. This is good, because if one did manage to flip, we’d find that a vector was negatively correlated with itself!
Since $\sigma_X = \sqrt{\text{cov}(X,X)}$, we already know that the standard deviation of $X$ is just the norm of $X$. So standard deviation is a norm. The two are equivalent. Sweet. Let’s move on.
The geometric interpretation of the correlation between two random variables, $X,Y$ is the cosine of the angle between them.
\[\begin{align*} \rho(X,Y) =& \frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y} \\ =& \frac{\langle X,Y\rangle}{\|X\|\cdot \|Y\|} \\ =& \cos(\theta) \end{align*}\]When we talk about how risky a portfolio is, intuitively you probably understand that as “what’s the probability that I lose all of my money”, and you’re not wrong. That would be what we call downside risk. Unfortunately, that’s not so easy to compute (at least not on paper), and there isn’t much in the way of mathematical research built up on the notion of downside risk. Maybe in a later post, I’ll go over how downside risk relates to the Information Ratio, but not today. There is, however, a lot of research built up on standard deviation, and that’s somewhat related (“how irregular are your returns” is not too bad of a proxy). So we define a portfolio’s risk as the standard deviation of the portfolio’s realized alpha (returns above the benchmark).6
Now comes the wonderful task of predicting future risk. This involves estimating a covariance matrix of asset-level returns. But if you’re in the equities world (and even more-so if you’re in the quantitative equities world), the covariance matrix is very large (often on the order of 5,000 by 5,000) and non-stationary7. This means we need a lot of data to fit a good covariance matrix, which means it’s slow to update and will inherently leave out newer assets. Clearly there is a place for a lower dimensional estimator of risk. We call such an estimator a risk model. There are several versions of a risk model, but this is meant to be a short post, so we’ll only go over Barra style factor models.
A factor risk model takes the following form:
\[(Xh)'V(Xh) + D h\]Where $h \in \mathbb{R}^n$ is our holdings (of $n$ many assets) vector, $X \in \mathbb{R}^{m\times n}$ is called our factor exposure matrix (usually shortened to simply, “exposure matrix”), $V \in \mathbb{R}^{m\times m}$ is called our factor covariance matrix, and $D \in \mathbb{R}^{n}$ is the specific risk vector. The exposure matrix is computed using endogenous data (as opposed to modeled data) like the stock’s N-day momentum, the stock’s market cap, number of employees, industry, etc.
If $m \ll n$, the covariance matrix can be computed more efficiently than the full $n\times n$ covariance matrix. This is not a silver bullet, though, as we’ve just segmented our problem into two parts: creating factor exposures and estimating factor risk. There are other types of risk models (such as statistical, shrunken, etc.), but generally they involve the same decomposition: a lower dimensional embedding, then covariance in that lower dimensional space8.
We can look at this style of risk model in a geometric way too. Let’s focus entirely on the first part ($(Xh)’V(Xh)$) for now, and furthermore let’s assume that we’ve fit the model already – meaning we’ve already computed $X$ and $V$ – the factor exposures and covariances.
The first step is to embed $h$ into the lower $m$ dimensional space via $X$. This means the factor model is not – in fact, can not – be a real covariance matrix. It’s rank is at most $m \lt n$ and hence it’s not positive definite. Given that we’re measuring covariance in factor space, the efficacy of the risk model rests heavily on how representative the factor exposures are.
One common task is to attribute some amount of the total risk to a particular factor. By that I mean that we have a portfolio and we would like to express the total risk of this portfolio in terms of the risk factors from our risk model. The theory suggests that we want the marginal contribution to risk (MCTR). One way of thinking about this is if we have our factor set $F$, then we would like to first express the risk budget as the idiosyncratic risk plus a linear combination of the factor risks, then the MCTR for a given factor is that factor’s coefficient.
\[\sigma = \sigma_I + \sum\limits_{f\in F}\sigma_ff\]But first, let’s go over a bit of mathematical prerequisites.
Let $u,v\in\mathbb{R}^n$. Given an inner product $\langle\cdot,\cdot\rangle$ (and the associated norm), the vector projection of $u$ onto $v$ (denoted as $\vec{\text{proj}}_vu$) is defined as:
\[\vec{\text{proj}}_vu := \frac{\langle u, v\rangle}{\|v\|} \frac{v}{\|v\|}\]Another way to see the definition, given our definition of $\cos$ is:
\[\begin{align*} \vec{\text{proj}}_vu =& \frac{\langle u, v\rangle}{\|v\|} \frac{v}{\|v\|} \\\\ =& \frac{\|u\|\langle u, v\rangle}{\|u\|\cdot\|v\|} \frac{v}{\|v\|} \\\\ =& \|u\|\cos(\theta) \frac{v}{\|v\|} \end{align*}\]Geometrically, this gives us the component of $u$ in the $v$ direction. See figure below.
The scaler-projection of $u$ onto $v$ is the coefficient of the vector projection and is denoted as (without the little vector hat):
\[\text{proj}_vu := \frac{\langle u, v\rangle}{\|v\|}\]The typical way to define marginal contribution to risk ($MCTR$) is as the partial derivative of the total risk with respect to our factor in question.
\[\begin{align*} MCTR(i) :=& \frac{\partial \sigma}{\partial f_i} \end{align*}\]But if we want to look at this geometrically, we’d ask, what’s the coefficient of this basis vector (factor)? Meaning, what’s $\sigma_f$ for this $\sigma$ in the following formula?
\[\sigma = \sigma_I + \sum\limits_{f\in F}\sigma_ff\]The typical way to do this is via projections. As you may recall from your Linear Algebra course, you find what the projection of $\sigma$ onto a given vector $f_i$ is, and that’s the coefficient you’re looking for. So now we just have to show that this version leads us to the old definition.
\[\begin{align*} \text{proj}_ff_i =& \frac{\langle f, f_i\rangle}{\left\|f\right\|} \\\\ =& \frac{f'Vf_i}{\|f\|} \\\\ =& \frac{(f'V + f'V')f_i}{2\|f\|} \\\\ =& \frac{1}{2\|f\|}\cdot \frac{\partial f'Vf}{\partial f_i} \\\\ =& \frac{1}{2\sqrt{\langle f, f\rangle}}\cdot \frac{\partial \langle f, f\rangle}{\partial f_i} \\\\ =& \frac{\partial \sqrt{\langle f, f\rangle}}{\partial f_i} \\\\ =& \frac{\partial \sigma}{\partial f_i} \\\\ =& MCTR(i) \end{align*}\]So it very literally finds the amount of $\sigma$ that is explained by $f_i$. And it comes with the all the geometric intuition we all know and love.
So how do we interpret specific risk (what we called $D$) in the factor models? Can we simply add a column to our covariance matrix for specific risk – even if it is mostly zeros? It turns out, no.
One way to see the difference is that the covariance part is a bilinear form, whereas the specific risk is linear. But let’s do a more qualitative investigation (it’s pretty rare for me to look specifically for a qualitative approach, but hey, every now and then…).
We have a surjective map $X:\mathbb{R}^n\to\mathbb{R}^m$ where $m\lt n$, so we know that the dimension of the kernel (the elements that go to 0) is $n - m$. So there is an $m$ dimensional subspace of holdings space that matters – as far as factors are concerned – and $n-m$ dimensions that don’t. We project that parts that do matter into factor space and compute the magnitude of risk there where we know the geometry, and ignore the $m-n$ dimensions that we haven’t modeled by these factors.
If we left it at that, we would have two main problems:
So we add a vector of asset level offsets, and call it specific risk.
Basically, a risk model is a quadratic form, and the specific risk is the linear term.
A Unit Factor Portfolio, sometimes called a Factor Mimicking Portfolio, is the characteristic portfolio with unit exposure to a given factor, and zero exposure to all others9. But what does that mean? Are they correlated? Are the returns correlated?
Given a factor embedding $X$, factor covariance matrix $V$, factor $f_i$, and the associated indicator variable $\mathbb{I}_i$, we’ll get as close as we can to computing a UFP. A UFP is the result of the following Legrangian optimization:
\[UFP_i = \text{argmin}_{h,\lambda}\left( (Xh)'V(Xh) + D h + \lambda (Xh - \mathbb{I}_i)\right)\]Because we know $Xh = \mathbb{I}_i$, we can reduce this down to:
\[\begin{align*} UFP_i =& \text{argmin}\left(\sigma_i + D h\right) \\ =& \text{argmin}\left(D h\right) \end{align*}\]Subject to the restriction that $Xh = \mathbb{I}_i$. Since the kernel of $X$ is non-trivial, we know it doesn’t have a left inverse, so we can’t just left-multiply this whole thing away.
So are the holdings from different UFPs correlated? Almost certainly, but not necessarily predictably so.
Are their returns correlated? If the risk model is any good, neglecting specific risk, the correlation between returns of $UFP_i$ and $UFP_j$ should be close to $V_{i,j}$ – the covariance between the two factors.
The purpose of this post was to work through some ideas on UFPs, and I think we did that.
See Active Portfolio Management by Richard Grinold and Ron Kahn for more information on this optimization. ↩
If you let $y$ be the vector of per-period returns of your fund, let $X$ be the vector of benchmark returns (say Market Cap Weighted S&P 500), then the $y = X \beta + \alpha$ regression yields the amount that your fund outperformed the benchmark and the correlation between the two. See my post on linear regression for more info. ↩
There are other types of random variables, but we won’t be going into those for this post. ↩
If you’re unfamiliar with the term “inner-product”, you can think of it as a more general version of a dot-product or see math-world’s definition. ↩
There is an additional requirement that $V$ be Hermitian, but $V$ is real valued, so it’s necessarily Hermitian. ↩
Perhaps a better solution would be to consider the standard deviation of the residual of the predicted returns against the realized returns. In practice, the predicted returns tend to be at least an order of magnitude smaller than realized returns, so this is not likely to be immediately fruitful. ↩
A non-stationarity distribution is one that changes over time. A lot of non-stationarity means your data from long ago is of an essentially different distribution and is not particularly helpful for estimating covariances. ↩
This is not really true with the shrunk (Ledoit-Wolf) covariance matrix, but the same effect is achieved by “shrinking” the covariance matrix to a diagonal one. For more information on that, see Honey, I Shrunk the Sample Covariance Matrix. ↩
For more information, see the following bloomberg article. ↩
Firstly, this post builds off of my previous post, so if you’re learning these things for the first time and find yourself kinda lost, start there.
In addition to the previous post, the following are mathematical definitions you should probably know before reading this post. If you don’t know any of these, trust google.
Definition:
Given a path $\alpha$ from $x_0$ to $x_1$ in $X$, we define $\hat\alpha:\pi_1(X,x_0)\to\pi_1(X,x_1)$ as follows:
\[\hat\alpha([f]) = [\bar\alpha]\star[f]\star[\alpha]\]And it’s a theorem that $\hat\alpha$ is a group isomorphism.
Definition:
Given a continuous map $h:(X,x_0)\to(Y,y_0)$, we define:
$h_\star:\pi_1(X,x_0)\to\pi_1(Y,y_0)$ as follows:
\[h\_\star([f]) = [h\circ f]\]And with that…
Let’s get into the questions!
Show that if $A\subseteq\mathbb{R}^n$ is star convex then $A$ is simply connected.
Let $a\in A$ be one of the points that make the set star convex. Our task now is to show:
For the first: just let $x,y\in A$ and let $f_x,f_y$ be the paths connecting them to $a$. Then note that $f_x \star \bar f_y$ is a path connecting $x$ to $y$.
For the second we can use the same straight-line homotopy that we used for the convex version: let $f$ be a path starting and ending at $a$ and let $H(s,t) = t\cdot a + (1-t)\cdot f(s)$.
Lastly, because part a (that I didn’t mention) is to find a set that’s star convex but not convex, I’ll leave the question with The Star of David.
Show that if $\alpha,\beta$ are paths from $x_0\to x_1\to x_2$ all in $X$ and $\gamma = \alpha\star\beta$ that $\hat\gamma = \hat\beta\circ\hat\alpha$.
Since these are functions between equivalence classes, our job is now to show that the outputs for a given input are homotopic.
So let $f$ be a path in $X$ starting and stopping at $x_0$. Then:
\[\begin{align*} \hat\gamma(f) =& [\bar\gamma]\star[f]\star[\gamma] \\ =& [\bar(\alpha\star\beta)]\star[f]\star[\alpha\star\beta]\\ =& [\bar\beta\star\bar\alpha]\star[f]\star[\alpha\star\beta]\\ =& [\bar\beta]\star[\bar\alpha]\star[f]\star[\alpha]\star[\beta]\\ =& [\bar\beta]\star([\bar\alpha]\star[f]\star[\alpha])\star[\beta]\\ =& [\bar\beta]\star(\hat\alpha(f))\star[\beta]\\ =& \hat\beta(\hat\alpha(f)) \end{align*}\]Show that if $x_0,x_1$ are points in a path-connected space $X$, $\pi_1(X,x_0)$ is abelian if and only if for every pair of paths from $\alpha,\beta$ from $x_0$ to $x_1$, $\hat\alpha = \hat\beta$.
If $\pi_1(X,x_0)$ is abelian, we have:
\[\begin{align*} \hat\alpha([f]) =& [\bar\alpha]\star[f]\star[\alpha]\\ =& [\bar\alpha]\star[\bar{(\beta\star\bar\alpha)} \star(\beta\star\bar\alpha)\star f]\star[\alpha]\\ =& [\bar\alpha]\star[(\alpha\star\bar\beta)\star f\star(\beta\star\bar\alpha)]\star[\alpha]\\ =& [\bar\alpha]\star[\alpha\star\bar\beta]\star[f]\star[\beta\star\bar\alpha]\star[\alpha]\\ =& [\bar\alpha\star\alpha\star\bar\beta]\star[f]\star[\beta\star\bar\alpha\star\alpha]\\ =&[\bar\beta]\star[f]\star[\beta] \\ =&\hat\beta([f]) \end{align*}\]Let $A\subset X$ and let $r:X\to A$ be a retraction. Show that for $a_0\in A$, $r_\star:\pi_1(X,a_0)\to\pi_1(A,a_0)$ is surjective.
Well, any path $\alpha$ in $A$ starting and stopping at $a_0$ will also be a path in $X$ (because $A\subset X$). So $r_\star([\alpha]_X) = [\alpha]_A$. It’s surjective because it’s the identity map when we restrict paths to $A$.
Let $A\subset\mathbb{R}^n$ and $h:(A,a_0)\to(Y,y_0)$. Show that if $h$ is extendable to a continuous map $\tilde h:\mathbb{R}^n\to Y$, then $h_\star$ is trivial (i.e. sends everybody to the class of the constant loop).
Let $G=\pi_1(A,a_0), H=\pi_1(Y,y_0)$. Then, as a reminder, $h_\star:G\to H$ such that $h_\star([\alpha]) = [h\circ\alpha]$.
So let $\alpha,\beta$ be paths in $(A,a_0)$. Since $\alpha,\beta$ were arbitrary, it is sufficient to show that $h_\star([\alpha])$ is homotopic to $h_\star([\beta])$.
Consider $F:I\times I\to \mathbb{R}^n$ given by $F(s,t) = t\alpha(s) + (1-t)\beta(s)$ (the straight line homotopy between the two loops). But since $h$ is extendible to $\tilde h:\mathbb{R}^n\to Y$, we know that even $F$ is a homotopy that leaves $A$ (into some other part of $\mathbb{R}^n$), $\tilde h\circ F$ is a homotopy between $h(\alpha),h(\beta)$ that stays entirely in $Y$. Hence $[\tilde h\circ \alpha] = [\tilde h\circ \beta]$.
Let $X$ be path connected, $h:X\to Y$ be continuous with $h(x_0) = y_0$ and $h(x_1)=y_1$; let $\alpha$ be a path in $X$ from $x_0$ to $x_1$, and $\beta = h\circ\alpha$. Show that
\[\hat\beta\circ(h\_{x\_0})\_\star = (h\_{x\_1})\_\star\circ\hat\alpha\]As in, show that the following diagram commutes:
\[\newcommand{\ra}[1]{\xrightarrow{\quad#1\quad}} \newcommand{\da}[1]{\left\downarrow{\scriptstyle#1}\vphantom{\displaystyle\int\_0^1}\right.} \newcommand{\sea}[1]{\left\searrow{\scriptstyle#1}\vphantom{\displaystyle\int\_0^1}\right.}\] \[\begin{array}{ccc} \pi\_1(X,x\_0) & \ra{(h\_{x\_0})\_\star} & \pi\_1(Y,y\_0) \\ \da{\hat\alpha} & & \da{\hat\beta} \\ \pi\_1(X,x\_1) & \ra{(h\_{x\_1})\_\star} & \pi\_1(Y,y\_1) \\ \end{array}\]Let $f=\hat\beta\circ (h_{x_0})_\star$ and $g=(h_{x_1})_\star\circ\hat\alpha$. The claim is now that
\[\forall a \in \pi\_1(X,x\_0), f(a) = g(a)\]Proof:
Let $b_0 \in f(a) = \hat\beta\circ(h_{x_0})_\star (c)$. That means $b$ is a particular path in $(Y,y_1)$ such that $b_0 = \bar\beta\star(h(c))\star\beta$, whereas $g(a) \ni b_1 = h(\bar\alpha\star d \star\alpha)$ with $c,d$ homotopic.
Let $\alpha$ be a path in $(X,x_0)$ and $\beta = h\circ\alpha$ – hence a path in $(Y,y_1)$.