Skip to main content

AI Basic Notes

Linear Algebra

Linear Space

向量空间的一组基: 张成 (span) 该空间的一个线性无关 (linearly independent) 向量集.

Linear Transformation

线性变换是指一个向量空间到另一个向量空间的映射, 满足加法和数乘运算的线性性质:

L(αv+βw)=αL(v)+βL(w)\begin{equation} L(\alpha\vec{v}+\beta\vec{w})=\alpha{L(\vec{v})}+\beta{L(\vec{w})} \end{equation}

Matrix representation:

v=xi+yj=[xy]Av=xi^+yj^=x[ac]i+y[bd]j=[abcd][xy]A2A1=A2i^+A2j^=([a2b2c2d2][a1c1])i+([a2b2c2d2][b1d1])j=(a1[a2c2]+c1[b2d2])i+(b1[a2c2]+d1[b2d2])j=[a2a1+b2c1a2b1+b2d1c2a1+d2c1c2b1+d2d1]\begin{split} \vec{v}&=x\vec{i}+y\vec{j} \\ &=\begin{bmatrix}x \\ y\end{bmatrix} \\ A\vec{v}&=x\hat{i}+y\hat{j} \\ &=x\begin{bmatrix}a \\ c\end{bmatrix}\vec{i} +y\begin{bmatrix}b \\ d\end{bmatrix}\vec{j} \\ &=\begin{bmatrix}a & b \\ c & d\end{bmatrix} \begin{bmatrix}x \\ y\end{bmatrix} \\ A_2A_1&=A_2\hat{i}+A_2\hat{j} \\ &=(\begin{bmatrix}a_2 & b_2 \\ c_2 & d_2\end{bmatrix} \begin{bmatrix}a_1 \\ c_1 \end{bmatrix})\vec{i} +(\begin{bmatrix}a_2 & b_2 \\ c_2 & d_2\end{bmatrix} \begin{bmatrix}b_1 \\ d_1\end{bmatrix})\vec{j} \\ &=(a_1\begin{bmatrix}a_2 \\ c_2\end{bmatrix} +c_1\begin{bmatrix}b_2 \\ d_2\end{bmatrix})\vec{i} +(b_1\begin{bmatrix}a_2 \\ c_2\end{bmatrix} +d_1\begin{bmatrix}b_2 \\ d_2\end{bmatrix})\vec{j} \\ &=\begin{bmatrix} a_2a_1+b_2c_1 & a_2b_1+b_2d_1 \\ c_2a_1+d_2c_1 & c_2b_1+d_2d_1 \end{bmatrix} \end{split}
Matrix Multiplication

左乘矩阵相当于对列向量进行线性变换, 右乘矩阵相当于对行向量进行线性变换.

Am×nA_{m\times n} 表示 n 维空间到 m 维空间的线性变换:

  • n 列: 输入空间有 n 个基向量, 即为 n 维空间.
  • m 行: 输出空间每个基向量对应 m 个坐标, 即为 m 维空间.
  • A1×nA_{1\times n} 表示 n 维空间到一维空间的线性变换: 向量点乘 (Dot Product) vw\vec{v} \cdot \vec{w} 可以理解为 w\vec{w} 通过 V1×nV_{1\times n} 变换到一维空间后的投影.
Dot Product and Cross Product
  • Dot Product: vw=vwcosθ\vec{v} \cdot \vec{w}=\|\vec{v}\|\|\vec{w}\|\cos{\theta}.
  • Cross Product: v×w=vwsinθ\|\vec{v} \times \vec{w}\|=\|\vec{v}\|\|\vec{w}\|\sin{\theta}.
  • v(v×w)=0\vec{v}\cdot(\vec{v}\times\vec{w})=0, w(v×w)=0\vec{w}\cdot(\vec{v}\times\vec{w})=0.

Basis changes, translating transformations:

vp=P1APwp\vec{v_p}=P^{-1}AP\vec{w_p}

Determinant

det(A)\det(A) 表示矩阵 A 的行列式, 表示该变换对空间的缩放因子:

Determinant

det(A)=0\det(A)=0 时, 表示该变换将空间压缩到一个低维空间, 称矩阵 AA 为奇异矩阵 (Singular Matrix):

  • 矩阵 AA 列向量线性相关.
  • 矩阵 AA 不满秩 (Not full rank).
  • 矩阵 AA 不可逆.

Determinant for 2d matrix:

abcd=adbc\begin{equation} \begin{vmatrix}a & b \\ c & d\end{vmatrix}=ad-bc \end{equation}

Determinant Diagram

Determinant for 3d matrix:

abcdefghi=aefhibdfgi+cdegh\begin{equation} \begin{vmatrix}a & b & c \\ d & e & f \\ g & h & i\end{vmatrix} =a\begin{vmatrix}e & f \\ h & i\end{vmatrix} -b\begin{vmatrix}d & f \\ g & i\end{vmatrix} +c\begin{vmatrix}d & e \\ g & h\end{vmatrix} \end{equation}

Determinant for matrix multiplication:

det(A1A2)=det(A1)det(A2)\begin{equation} \det(A_1A_2)=\det(A_1)\det(A_2) \end{equation}

Gaussian Elimination

高斯消元法求解线性方程组 (Linear System Of Equations):

首先第一行的第一个元素化为 1, 下面每行减去第一行乘以该行第一个元素的倍数, 从而把第一列除第一行外的全部元素都化为 0, 进而把第二列除前两个元素之外的元素都化为 0, 最后把矩阵化为上三角矩阵. 类似地, 从最后一行开始, 逐行把上三角矩阵化为单位矩阵.

Ax=vA1Ax=A1vx=A1v\begin{split} A\vec{x}&=\vec{v} \\ A^{-1}A\vec{x}&=A^{-1}\vec{v} \\ \vec{x}&=A^{-1}\vec{v} \end{split}

Eigenvalue and Eigenvector

A=[abcd]A=\begin{bmatrix} a & b \\ c & d \end{bmatrix} eigenvalue Av=λvA\vec{v}=\lambda\vec{v} quick calculation:

λ=m±m2p=λ1+λ22±(λ1+λ22)2λ1λ2=a+d2±(a+d2)2(adbc)\begin{split} \lambda&=m\pm\sqrt{m^2-p} \\ &=\frac{\lambda_1+\lambda_2}{2} \pm\sqrt{(\frac{\lambda_1+\lambda_2}{2})^2-\lambda_1\lambda_2} \\ &=\frac{a+d}{2}\pm\sqrt{(\frac{a+d}{2})^2-(ad-bc)} \end{split}

Linear Algebra Reference

Mathematical Analysis

Limit

洛必达法则是求解分数形式的未定型极限 limxa00\lim_{x\to{a}}\frac{0}{0} 的有效方法之一:

limxaf(x)g(x)=limxadf(x)dg(x)=limxadfdx(a)dxdgdx(a)dx=limxadfdx(a)dgdx(a)=limxaf(a)g(a)\begin{equation} \begin{split} \lim_{x\to{a}}\frac{f(x)}{g(x)} &=\lim_{x\to{a}}\frac{df(x)}{dg(x)} \\ &=\lim_{x\to{a}}\frac{\frac{df}{dx}(a)dx}{\frac{dg}{dx}(a)dx} \\ &=\lim_{x\to{a}}\frac{\frac{df}{dx}(a)}{\frac{dg}{dx}(a)} \\ &=\lim_{x\to{a}}\frac{f'(a)}{g'(a)} \end{split} \end{equation}

Derivative

常见导数:

ddxxn=nxn1ddxsinx=cosxddxcosx=sinxddxax=axlnaddxex=exddxlogax=1xlnaddxlnx=1xddx(g(x)+h(x))=g(x)+h(x)ddx(g(x)h(x))=g(x)h(x)+g(x)h(x)ddxf(g(x))=f(g(x))g(x)ddxf1(x)=1f(f1(x))ddxa(x)b(x)f(t)dt=f(b(x))b(x)f(a(x))a(x)\begin{equation} \begin{split} \frac{d}{dx}x^n&=nx^{n-1} \\ \frac{d}{dx}\sin{x}&=\cos{x} \\ \frac{d}{dx}\cos{x}&=-\sin{x} \\ \frac{d}{dx}a^x&=a^x\ln{a} \\ \frac{d}{dx}e^x&=e^x \\ \frac{d}{dx}\log_a{x}&=\frac{1}{x\ln{a}} \\ \frac{d}{dx}\ln{x}&=\frac{1}{x} \\ \frac{d}{dx}(g(x)+h(x))&=g'(x)+h'(x) \\ \frac{d}{dx}(g(x)h(x))&=g'(x)h(x)+g(x)h'(x) \\ \frac{d}{dx}f(g(x))&=f'(g(x))g'(x) \\ \frac{d}{dx}f^{-1}(x)&=\frac{1}{f'(f^{-1}(x))} \\ \frac{d}{dx}\int_{a(x)}^{b(x)}f(t)dt&=f(b(x))b'(x)-f(a(x))a'(x) \end{split} \end{equation}

Series

泰勒级数利用函数在某点的各阶导数, 近似该点附近函数的值:

11x=n=0xnx<1ex=n=0xnn!ln(1+x)=n=1(1)n1nxnx(1,1]sin(x)=n=0(1)n(2n+1)!x2n+1cos(x)=n=0(1)n(2n)!x2nf(x)=n=0f(n)(x0)n!(xx0)n=f(x0)+f(x0)(xx0)+f(x0)2!(xx0)2+\begin{equation} \begin{split} \frac{1}{1-x}&=\sum\limits_{n=0}^{\infty}x^n \quad |x|\lt1 \\ e^x&=\sum\limits_{n=0}^{\infty}\frac{x^n}{n!} \\ \ln(1+x)&=\sum\limits_{n=1}^{\infty}\frac{(-1)^{n-1}}{n}x^n \quad x\in(-1,1] \\ \sin(x)&=\sum\limits_{n=0}^{\infty}\frac{(-1)^n}{(2n+1)!}x^{2n+1} \\ \cos(x)&=\sum\limits_{n=0}^{\infty}\frac{(-1)^n}{(2n)!}x^{2n} \\ f(x)&=\sum\limits_{n=0}^{\infty}\frac{f^{(n)(x_0)}}{n!}(x-x_0)^n \\ &=f(x_0)+f'(x_0)(x-x_0)+\frac{f''(x_0)}{2!}(x-x_0)^2+\dots \end{split} \end{equation}

Euler's Formula

复数平面 (Complex Plane) 上的圆周运动:

eix=cosx+isinx\begin{equation} e^{ix}=\cos{x}+i\sin{x} \end{equation}

Fourier Transform

Time to frequency transform:

f^(ξ)=f(t)e2πiξtdt\begin{equation} \hat{f}(\xi)=\int_{-\infty}^{\infty}f(t)e^{-2\pi i\xi t}dt \end{equation}

Discrete Fourier Transform (DFT):

X[k]=n=0N1xnei2πNkn\begin{equation} X[k]=\sum\limits_{n=0}^{N-1}x_n e^{-\frac{i2\pi}{N}kn} \end{equation}

outcomes

[11111e2πine2πi(2)ne2πi(n1)n1e2πi(2)ne2πi(4)ne2πi(2)(n1)n1e2πi(n1)ne2πi(2)(n1)ne2πi(n1)(n1)n]\begin{bmatrix} 1 & 1 & 1 & \dots & 1 \\ 1 & e^{\frac{2\pi i}{n}} & e^{\frac{2\pi i(2)}{n}} & \dots & e^{\frac{2\pi i(n-1)}{n}} \\ 1 & e^{\frac{2\pi i(2)}{n}} & e^{\frac{2\pi i(4)}{n}} & \dots & e^{\frac{2\pi i(2)(n-1)}{n}} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & e^{\frac{2\pi i(n-1)}{n}} & e^{\frac{2\pi i(2)(n-1)}{n}} & \dots & e^{\frac{2\pi i(n-1)(n-1)}{n}} \end{bmatrix}

Fourier Transform

Differential Equation

微分方程 (Differential Equation) 是描述变量之间关系的方程, 通常包含未知函数及其导数, 用于描述物理现象和自然规律.

First Order Differential Equation

一阶微分方程:

ddt[x(t)y(t)]=[abcd][x(t)y(t)][x(t)y(t)]=e[abcd]t[x(0)y(0)]\begin{equation} \frac{d}{dt}\begin{bmatrix}x(t)\\y(t)\end{bmatrix} =\begin{bmatrix}a&b\\c&d\end{bmatrix}\begin{bmatrix}x(t)\\y(t)\end{bmatrix} \Rightarrow \begin{bmatrix}x(t)\\y(t)\end{bmatrix} =e^{\begin{bmatrix}a&b\\c&d\end{bmatrix}t}\begin{bmatrix}x(0)\\y(0)\end{bmatrix} \end{equation} ifv(t)=eMtv0thenddtv(t)=ddteMtv0=ddtn=0Mnn!tnv0=n=0Mnn!ntn1v0=Mn=0Mn1(n1)!tn1v0=MeMtv0=Mv(t)\begin{split} \text{if} \quad \vec{v}(t)&=e^{Mt}\vec{v}_0 \\ \text{then} \quad \frac{d}{dt}\vec{v}(t) &=\frac{d}{dt}e^{Mt}\vec{v}_0 \\ &=\frac{d}{dt}\sum\limits_{n=0}^{\infty}\frac{M^n}{n!}t^n\vec{v}_0 \\ &=\sum\limits_{n=0}^{\infty}\frac{M^n}{n!}nt^{n-1}\vec{v}_0 \\ &=M\sum\limits_{n=0}^{\infty}\frac{M^{n-1}}{(n-1)!}t^{n-1}\vec{v}_0 \\ &=Me^{Mt}\vec{v}_0 \\ &=M\vec{v}(t) \end{split}

Second Order Differential Equation

x¨(t)=μx˙(t)ωx(t)\begin{equation} \ddot{x}(t)=-\mu\dot{x}(t)-\omega x(t) \end{equation}

Gravitational force equation:

y¨(t)=g,y˙(t)=gt+v0dx1dt=v1,dv1dt=Gm2(x2x1x2x1)(1x2x12)θ¨(t)=μθ˙(t)gLsin(θ(t))\begin{split} \ddot{y}(t)=-g, \quad & \dot{y}(t)=-gt+v_0 \\ \frac{d\vec{x}_1}{dt}=\vec{v}_1, \quad & \frac{d\vec{v}_1}{dt}=Gm_2\Big(\frac{\vec{x}_2-\vec{x}_1}{\|\vec{x}_2-\vec{x}_1\|}\Big)\Big(\frac{1}{\|\vec{x}_2-\vec{x}_1\|^2}\Big) \\ & \ddot{\theta}(t)=-\mu\dot{\theta}(t)-\frac{g}{L}\sin\big({\theta}(t)\big) \end{split}

Partial Differential Equation

热传导方程:

Tt(x,t)=α2Tx2(x,t)\frac{\partial{T}}{\partial{t}}(x,t)=\alpha\frac{\partial^2{T}}{\partial{x^2}}(x,t)

Black-Scholes / Merton equation:

Vt+rSVS+12σ2S22VS2rV=0\frac{\partial{V}}{\partial{t}}+rS\frac{\partial{V}}{\partial{S}}+\frac{1}{2}\sigma^2S^2\frac{\partial^2{V}}{\partial{S^2}}-rV=0

Phase Space

相空间是描述系统状态的空间, 每个点代表系统的一个状态, 点的轨迹描述了系统的演化.

import numpy as np

# Physical constants
g = 9.8
L = 2
mu = 0.1

THETA_0 = np.pi / 3 # 60 degrees
THETA_DOT_0 = 0 # No initial angular velocity

# Definition of ODE
def get_theta_double_dot(theta, theta_dot):
return -mu * theta_dot - (g / L) * np.sin(theta)

# Solution to the differential equation (numerically)
def theta(t):
theta = THETA_0
theta_dot = THETA_DOT_0
delta_t = 0.01 # Time step
for _ in np.arange(0, t, delta_t):
theta_double_dot = get_theta_double_dot(theta, theta_dot)
theta += theta_dot * delta_t
theta_dot += theta_double_dot * delta_t
return theta

Mathematical Statistics

Normal Distribution

若随机变量 XX 服从一个位置参数为 μ\mu, 尺度参数为 σ\sigma 的概率分布, 且其概率密度函数 (Probability Density Function, PDF) 为:

f(x)=1σ2πe12(xμσ)2\begin{equation} f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \end{equation}

则这个随机变量称为正态随机变量, 正态随机变量服从的分布称为正态分布, 记作 XN(μ,σ2)X \sim N(\mu,\sigma^2), 读作 XX 服从 N(μ,σ2)N(\mu,\sigma^2) (正态分布). 其中 μ\mu 为均值 (数学期望 Mean), σ\sigma 为标准差 (Standard Deviation).

正态分布 (又称 Gaussian Distribution) 是一种连续概率分布. 当 μ\mu 为 0, σ\sigma 为 1 时, 称为标准正态分布 (Standard Normal Distribution).

Central Limit Theorem

在自然界与生产中, 一些现象受到许多相互独立的随机因素的影响, 如果每个因素所产生的影响都很微小时, 总影响 (Sum) 可以看作服从正态分布.

相互独立的正态分布, 其和也是正态分布. 总体正态分布的均值等于各个分布的均值之和, E(X1++Xn)=E(X1)++E(Xn)=nμE(X_1+\dots+X_n)=E(X_1)+\dots+E(X_n)=n\mu. 假设协方差为 0, 则总体正态分布的方差等于各个分布的方差之和, Var(X1++Xn)=Var(X1)++Var(Xn)=nσ2{Var(X_1+\dots+X_n)}={Var(X_1)+\dots+Var(X_n)}={n}\sigma^2, 可以得到总体正态分布的标准差为 nσ\sqrt{n}\sigma.

设随机变量 X1,X2,,XnX_1,X_2,\dots,X_n 独立同分布(Independent Identically Distribution), 且均值为 E(Xi)=μE(X_i)=\mu, 方差为 D(Xi)=σ2D(X_i)=\sigma^2, 对于任意 xx, 其分布函数为

Fn(x)=P{i=1nXinμnσx}F_n(x)=P\left\{\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma}\leq{x}\right\}

满足

limnFn(x)=limnP{i=1nXinμnσx}=12πxet22dt=(x)\begin{equation} \lim_{n\to\infty}F_n(x) =\lim_{n\to\infty}P\left\{\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma}\leq{x}\right\} =\frac{1}{\sqrt{2\pi}}\int_{-\infty}^x{e^{-\frac{t^2}{2}}dt} =\varnothing(x) \end{equation}

独立同分布的中心极限定理说明, 当 nn 足够大时, 随机变量 Xn=i=1nXiX_n=\sum\limits_{i=1}^n{X_i} 近似服从正态分布 N(nμ,nσ2)N(n\mu,n\sigma^2); 标准化后的随机变量 Yn=i=1nXinμnσY_n=\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma} 近似服从标准正态分布 N(0,1)N(0,1).

Central Limit Theorem

更一般化的中心极限定理, 可参见林德伯格中心极限定理 (Lindeberg CLT) etc.

Gaussian Integral

ex2dx=π\begin{equation} \int_{-\infty}^{\infty}e^{-x^2}dx=\sqrt{\pi} \end{equation}

高维空间求解高斯积分:

Gaussian Integral

对于正态分布, 系数 1π\frac{1}{\sqrt{\pi}} 使得概率密度函数的积分为 1, 即 f(x)dx=1\int_{-\infty}^{\infty}f(x)dx=1, 使其成为有意义的概率分布.

Binomial Distribution

重复 n 次独立的伯努利试验, XB(n,p)X \sim B(n,p), 期望值 E(X)=npE(X)=np, 方差 D(X)=np(1p)D(X)=np(1-p):

P(X=k)=Cnkpk(1p)nk\begin{equation} P(X=k)=C_n^kp^k(1-p)^{n-k} \end{equation}

Bayes Theorem

Bayes theorem:

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B)=P(A|B)P(B)=P(B|A)P(A)\Rightarrow P(AB)=P(BA)P(A)P(B)=P(BA)P(A)P(BA)P(A)+P(B¬A)P(¬A)\begin{equation} P(A|B)=\frac{P(B|A)P(A)}{P(B)}=\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\neg{A})P(\neg{A})} \end{equation}

Bayes Theorem

其中, P(BA)P(B¬A)\frac{P(B|A)}{P(B|\neg{A})} 称为贝叶斯系数 (Bayes Factor):

O(AB)=P(AB)P(¬AB)=P(AB)P(B)P(¬AB)P(B)=P(BA)P(A)P(B¬A)P(¬A)=O(A)P(BA)P(B¬A)O(A|B)=\frac{P(A|B)}{P(\neg{A}|B)}=\frac{P(A|B)P(B)}{P(\neg{A}|B)P(B)}=\frac{P(B|A)P(A)}{P(B|\neg{A})P(\neg{A})}=O(A)\frac{P(B|A)}{P(B|\neg{A})}

Information Entropy

信息熵 是对信息量的度量 (E[I]E[I]), 概率小的事件发生所带来的信息量大, 概率大的事件发生所带来的信息量小, 即概率小, 出现机会小, 不确定性大, 信息熵大, 信息量大:

H(X)=E[log2P(xi)]=i=1nP(xi)log2P(xi)\begin{equation} H(X)=E[-\log_2{P(x_i)}]=-\sum\limits_{i=1}^n{P(x_i)\log_2{P(x_i)}} \end{equation}

Supervised Learning

Regression

Output a scalar:

  • Linear regression: y=Wx+b=i=1nwixi+by=Wx+b=\sum\limits_{i=1}^n{w_ix_i}+b, L=i=1n(yiy^i)2L=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2.
  • Polynomial regression: y=i=1nwixi+by=\sum\limits_{i=1}^n{w_ix^i}+b.
  • Logistic regression (output probability): y=σ(Wx+b)=11+ei=1nwixiby=\sigma(Wx+b)=\frac{1}{1+e^{-\sum\limits_{i=1}^n{w_ix_i}-b}}, L=i=1nyilog(y^i)L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}.

If model can't even fit training data, then model have large bias (underfitting). If model can fit training data but not testing data, then model have large variance (overfitting).

Underfitting

To prevent underfitting, we can:

  • Add more features as input.
  • Use more complex model.
Overfitting

More complex model does not always lead to better performance on testing data or new data.

ModelTraining ErrorTesting Error
xx31.935.0
x2x^215.418.4
x3x^315.318.1
x4x^414.928.2
x5x^512.8232.1

Regularization can help to prevent overfitting:

L(w)=i=1n(yiy^i)2+λi=1nwi2wt+1=wtηL(w)=wtη(Lw+λwt)=(1ηλ)wtηLw(Weight Decay)\begin{split} L(w)&=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2+\lambda\sum\limits_{i=1}^n{w_i^2}\\ w_{t+1}&=w_t-\eta\nabla{L(w)}\\ &=w_t-\eta(\frac{\partial{L}}{\partial{w}}+\lambda{w_t})\\ &=(1-\eta\lambda)w_t-\eta\frac{\partial{L}}{\partial{w}} \quad (\text{Weight Decay}) \end{split}

Classification

  • Binary classification: y=δ(Wx+b)y=\delta(Wx+b), L=i=1nδ(yiy^i)L=\sum\limits_{i=1}^n\delta(y_i\ne\hat{y}_i), e.g spam filtering.
  • Multi-class classification: y=softmax(Wx+b)y=\text{softmax}(Wx+b), L=i=1nyilog(y^i)L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}, e.g document classification.
  • Non-linear model:
    • Deep learning: y=softmax(ReLU(Wx+b))y=\text{softmax}(\text{ReLU}(Wx+b)), e.g image recognition, game playing.
    • Support vector machine (SVM): y=sign(Wx+b)y=\text{sign}(Wx+b).
    • Decision tree: y=vote(leaves(x))y=\text{vote}(\text{leaves}(x)).
    • K-nearest neighbors (KNN): y=vote(neighbors(x))y=\text{vote}(\text{neighbors}(x)).

Structured Learning

Training

Find a function FF:

F:X×YRF:X\times{Y}\to{R}

F(x,y)F(x, y) evaluates how well yy fits xx (object compatible).

Inference

Given an object xx:

y~=argmaxyYF(x,y)\tilde{y}=\arg\max\limits_{y\in{Y}}F(x, y)

Structured Learning

Three Problems
  • Evaluation: what does F(X,y)F(X, y) look like.
  • Inference: how to solve argmax\arg\max problem.
  • Training: how to find F(x,y)F(x, y) with given training data.

Structured Linear Model

F(x,y)=i=1nwiϕi(x,y)=[w1w2w3wn][ϕ1(x,y)ϕ2(x,y)ϕ3(x,y)ϕn(x,y)]=WΦ(x,y)\begin{split} F(x, y)&=\sum\limits_{i=1}^n{w_i\phi_i(x, y)} \\ &=\begin{bmatrix}w_1\\w_2\\w_3\\\vdots\\w_n\end{bmatrix}\cdot \begin{bmatrix}\phi_1(x, y)\\\phi_2(x, y)\\\phi_3(x, y)\\\vdots\\\phi_n(x, y)\end{bmatrix}\\ &=W\cdot\Phi(x, y) \end{split}

Unsupervised Learning

Principal Component Analysis

主成分分析 (PCA) 是一种常用的数据降维方法, 将 mmnn 维向量降为 kk 维, 其目标是选择 kk 个单位 (模为 11) 正交基, 使得原始数据变换到这组基上后, 各字段两两间协方差为 00 (各字段完全独立), 各字段的方差尽可能大 (各字段降维后分布尽可能分散), 即在正交的约束下, 取最大的 kk 个方差:

C=1mXXT=1m[i=1m(xi1)2i=1mxi1xi2i=1mxi1xini=1mxi2xi1i=1m(xi2)2i=1mxi2xini=1mxinxi1i=1mxinxi2i=1m(xin)2]\begin{equation} C=\frac{1}{m}XX^T=\frac{1}{m}\begin{bmatrix} \sum_{i=1}^m(x_i^1)^2&\sum_{i=1}^m{x_i^1x_i^2}&\dots&\sum_{i=1}^m{x_i^1x_i^n}\\ \sum_{i=1}^m{x_i^2x_i^1}&\sum_{i=1}^m(x_i^2)^2&\dots&\sum_{i=1}^m{x_i^2x_i^n}\\ \vdots&\vdots&\ddots&\vdots\\ \sum_{i=1}^m{x_i^nx_i^1}&\sum_{i=1}^m{x_i^nx_i^2}&\dots&\sum_{i=1}^m(x_i^n)^2 \end{bmatrix} \end{equation}

协方差矩阵 CC 是一个对称矩阵, 其对角线分别为各字段的方差, 其第 iijj 列和第 jjii 列元素相同, 表示 iijj 两个字段的协方差. 将协方差矩阵对角化, 得到基于矩阵运算的 PCA 算法如下:

  • 将原始数据按列组成 nnmm 列矩阵 XX.
  • XX 的每一行 (代表一个属性字段) 进行零均值化, 即减去这一行的均值, 使得 xˉ=0\bar{x}=0, 方便方差与协方差的计算.
  • 求出协方差矩阵 C=1mXXTC=\frac{1}{m}XX^T 的特征值及对应的特征向量.
  • 将特征向量按对应特征值大小从上到下按行排列成矩阵, 取前 kk 行组成矩阵 PP.
  • Y=PXY=PX 即为降维到 kk 维后的数据.

Word Embedding

词嵌入是自然语言处理 (NLP) 中的一种技术, 将词汇映射到实数向量空间, 使得词汇之间的语义关系可以通过向量空间中的距离来表示.

Variational Auto-Encoders

变分自编码器 (VAEs) 是一种生成模型, 通过学习数据的潜在分布来生成新的数据: Z=Encoder(X),X=Decoder(Z),Min Loss(X,X)Z=\text{Encoder}(X), X'=\text{Decoder}(Z), \text{Min Loss}(X',X). 变分自动编码器学习的是隐变量 (特征) ZZ 的概率分布, zN(0,I),xzN(μ(z),σ(z))z\sim N(0, I), x|z\sim N\big(\mu(z), \sigma(z)\big), 通过深度网络来学习 q(zx)q(z|x) 的参数, 一步步优化 qq 使其与 p(zx)p(z|x) 十分相似, 便可用它来对复杂的分布进行近似的推理.

Variational Auto-Encoders

Generative Adversarial Networks

生成对抗网络 (GANs) 由两个网络组成: 生成器 (Generator) 和判别器 (Discriminator). 生成器的目标是生成尽可能逼真的数据, 判别器的目标是尽可能准确地区分真实数据和生成数据. 两个网络相互对抗, 生成器生成数据 (decoder in VAE), 判别器判断数据真伪 (1/01/0 classification neural network), 生成器根据判别器的判断结果调整生成数据的策略, 不断提升生成数据的逼真程度.

Self-supervised Learning

Pre-trained model + fine-tuning.

Reinforcement Learning

强化学习是一种机器学习方法, 通过智能体与环境交互, 智能体根据环境的反馈调整策略, 利用梯度上升算法 (Gradient Ascent), 最大化长期奖励 (learn from rewards and mistakes).

θ=argmaxθRˉθ=argmaxθτR(τ)P(τθ)θt+1=θt+ηRˉθRˉθ=[Rˉθw1Rˉθw2Rˉθb1]\begin{equation} \begin{split} \theta^*&=\arg\max\limits_\theta\bar{R}_\theta=\arg\max\limits_\theta\sum\limits_{\tau}R(\tau)P(\tau|\theta)\\ \theta_{t+1}&=\theta_t+\eta\nabla\bar{R}_\theta\\ \nabla\bar{R}_\theta&=\begin{bmatrix}\frac{\partial\bar{R}_\theta}{\partial{w_1}}\\\frac{\partial\bar{R}_\theta}{\partial{w_2}}\\\vdots\\\frac{\partial\bar{R}_\theta}{\partial{b_1}}\\\vdots\end{bmatrix} \end{split} \end{equation}

Reinforcement Learning

Multilayer Perceptron

Multilayer Perceptron

多层感知机是一种前馈神经网络 (Feedforward Neural Network) 就像是一个模拟大脑处理信息的过程, 通过多层处理 (输入层, 隐藏层, 输出层), 从原始数据中提取特征, 并做出预测或分类, 它通过调整内部连接权重来学习和改进其预测能力.

Linear Mapping

线性变换 Hl=WlXl1+BlH^l=W^lX^{l-1}+B^l:

  • wijlw_{ij}^l (weight): 第 ll 层第 ii 个节点与上一层第 jj 个节点连接的权重.
  • bilb_i^l (bias): 第 ll 层第 ii 个节点的偏置.
H=[w00lw01lw0nlw10lw11lw1nlwk0lwk1lwknl][x0l1x1l1xnl1]+[b0lb1lbkl]H=\begin{bmatrix} w_{00}^l & w_{01}^l & \dots & w_{0n}^l \\ w_{10}^l & w_{11}^l & \dots & w_{1n}^l \\ \vdots & \vdots & \ddots & \vdots \\ w_{k0}^l & w_{k1}^l & \dots & w_{kn}^l \end{bmatrix} \begin{bmatrix} x_0^{l-1} \\ x_1^{l-1} \\ \vdots \\ x_n^{l-1} \end{bmatrix} +\begin{bmatrix} b_0^l \\ b_1^l \\ \vdots \\ b_k^l \end{bmatrix}

Activation Function

激活函数 y=σ(H)y=\sigma(H), Xl=σ(Hl)X^l=\sigma(H^l):

  • 引入非线性特性, 使得网络可以学习和模拟复杂函数.
  • ReLU (Rectified Linear Unit, 线性整流单元): σ(H)=max(0,H)\sigma(H)=\max(0,H), 可以解决梯度消失问题 (越靠近输入层的神经元梯度越接近 0), 加速收敛.
  • Sigmoid: σ(H)=11+eH\sigma(H)=\frac{1}{1+e^{-H}}.
  • e.g 归一化函数, 使得输出值在 0 到 1 之间, 可以使得整个网络成为概率模型.

Loss Function

损失函数 L(y,y^)L(y,\hat{y}):

  • 用于衡量真实值(或人工标注值) yy 与模型预测值 y^\hat{y} 之间的差异.
  • 常见的损失函数有均方误差 (Mean Squared Error, MSE) 和交叉熵 (Cross Entropy).
    • 均方误差: L(y,y^)=1ni=1n(yiy^i)2L(y,\hat{y})=\frac{1}{n}\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2.
    • 交叉熵: L(y,y^)=i=1nyilog(y^i)L(y,\hat{y})=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}.
Learning

Learning is the process of minimizing loss function, finally find out the right weights and biases:

  • Early stopping.
  • Regularization.
  • Dropout.
  • New activation function.
  • Adaptive learning rate.

Gradient Descent

通过梯度下降算法, 优化损失函数, 使其最小化 (沿梯度下降方向, 调整 W 和 B):

θt+1=θtηL\begin{equation} \theta_{t+1}=\theta_t-\eta\nabla{L} \end{equation}

其中, η\eta 为学习率, LL 为损失函数, L\nabla{L} 为损失函数的梯度, θ\theta 为模型参数, tt 为迭代次数.

argminL(θ)=L(a,b)+L(a,b)θ1(θ1a)+L(a,b)θ2(θ2b)++1n!nL(a,b)θ1n(θ1a)n+1n!nL(a,b)θ2n(θ2b)nL(a,b)+L(a,b)θ1(θ1a)+L(a,b)θ2(θ2b)[θ1aθ2b]=η[L(a,b)θ1L(a,b)θ2][θ1θ2]=[ab]η[L(a,b)θ1L(a,b)θ2]\begin{split} \arg\min{L(\theta)}&=L(a,b)+\frac{\partial{L(a,b)}}{\partial{\theta_1}}(\theta_1-a)+\frac{\partial{L(a,b)}}{\partial{\theta_2}}(\theta_2-b)\\ &+\dots+\frac{1}{n!}\frac{\partial^n{L(a,b)}}{\partial{\theta_1^n}}(\theta_1-a)^n+\frac{1}{n!}\frac{\partial^n{L(a,b)}}{\partial{\theta_2^n}}(\theta_2-b)^n\\ &\approx{L(a,b)}+\frac{\partial{L(a,b)}}{\partial{\theta_1}}(\theta_1-a)+\frac{\partial{L(a,b)}}{\partial{\theta_2}}(\theta_2-b) \\ &\Rightarrow \begin{bmatrix}\theta_1-a\\\theta_2-b\end{bmatrix} =-\eta\begin{bmatrix}\frac{\partial{L(a,b)}}{\partial{\theta_1}}\\\frac{\partial{L(a,b)}}{\partial{\theta_2}}\end{bmatrix} \\ &\Rightarrow \begin{bmatrix}\theta_1 \\ \theta_2\end{bmatrix} =\begin{bmatrix}a\\b\end{bmatrix}-\eta\begin{bmatrix}\frac{\partial{L(a,b)}}{\partial{\theta_1}}\\\frac{\partial{L(a,b)}}{\partial{\theta_2}}\end{bmatrix} \end{split}
Gradient

一个优秀的梯度下降算法, 需要满足以下几个条件:

  • 高效性: 梯度下降算法的迭代次数尽可能少. 当距离最小值较远时, L0\nabla{L}\gg0, 当距离最小值较近时, L0\nabla{L}\to0, 这样的梯度下降算法可以更快地收敛. 反之, 当距离最小值较远时, L0\nabla{L}\to0, 这样的梯度下降算法更慢收敛.
  • 稳定性: 梯度下降算法的迭代过程尽可能稳定. Maxout 激活函数拟合能力非常强, 可以拟合任意的凸函数.
  • 鲁棒性: 梯度下降算法对于初始值的选择或者特定的线索片段不敏感. Dropout 策略减少神经元之间复杂的共适应关系 (每个神经元有 p%p\% 概率不被激活), 迫使网络去学习更加鲁棒的特征, 缓解过拟合问题, 提高模型的泛化能力.

Gradient Descent

def convex_function(x):
return x**2

def gradient_descent(initial_x, learning_rate, num_iterations):
x_values = [initial_x]
y_values = [convex_function(initial_x)]

x = initial_x

for i in range(num_iterations):
gradient = 2 * x # 函数 f(x) = x^2 的导数为 f'(x) = 2x
x -= learning_rate * gradient

x_values.append(x)
y_values.append(convex_function(x))

return x_values, y_values
Learning Rate

必要时需要调整学习率, 使得梯度下降更快收敛:

  • 如果学习率过大, 可能会导致梯度下降不稳定, 甚至发散.
  • 如果学习率过小, 可能会导致梯度下降收敛速度过慢.
  • 常见的学习率调整策略有:
    • 阶梯衰减 (Step Decay): ηt=ηt+1\eta_t=\frac{\eta}{\sqrt{t+1}}.
    • 线性衰减 (Linear Decay): ηt=η(1tT)\eta_t=\eta(1-\frac{t}{T}).
    • 指数衰减 (Exponential Decay): ηt=ηekt\eta_t=\eta{e^{-kt}}.
    • 余弦衰减 (Cosine Decay): ηt=η1+cos(πtT)2\eta_t=\eta\frac{1+\cos(\frac{\pi{t}}{T})}{2}.
    • AdaGrad: adaptive sub-gradient method, wt+1=wtηt+11t+1i=0tgi2gt=wtηi=0tgi2gtw_{t+1}=w_t-\frac{\frac{\eta}{\sqrt{t+1}}}{\sqrt{\frac{1}{t+1}\sum_{i=0}^t{g_i^2}}}g_t=w_t-\frac{\eta}{\sqrt{\sum_{i=0}^t{g_i^2}}}g_t.
    • RMSprop: root mean square propagation, wt+1=wtησtgt=wtηασt12+(1α)gt2gtw_{t+1}=w_t-\frac{\eta}{\sigma_t}g_t=w_t-\frac{\eta}{\sqrt{\alpha\sigma_{t-1}^2}+(1-\alpha)g_t^2}g_t,
    • Momentum: wt+1=wt+vt+1=wt+λvtηgtw_{t+1}=w_t+v_{t+1}=w_t+\lambda{v_t}-\eta{g_t}.
    • Adam: adaptive moment estimation (RMSprop + Momentum).

Backpropagation

反向传播算法: 从最小化损失函数出发, 由输出层到输入层, 通过链式法则, 计算每一层的梯度, 从而更新权重和偏置.

Backpropagation

Derivative chain rule (链式法则):

Lwijl=Lxilxilzilzilwijl=Lxilσ(zil)zil(Xl1Wil+bil)wijl=δilσ(zil)xjl1Lbil=Lxilxilzilzilbil=Lxilσ(zil)zil(bil+WilXl1)bil=δilσ(zil)Lxjl1=i=0Nl1Lxilxilzilzilxjl1=i=0Nl1Lxilσ(zil)zil(WilXl1+bil)xjl1=i=0Nl1δilσ(zil)wijl\begin{split} \frac{\partial{L}}{\partial{w_{ij}^l}} &=\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot \frac{\partial{z_i^l}}{\partial{w_{ij}^l}} \\ &=\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot \frac{\partial(X^{l-1}W_i^l+b_i^l)}{\partial{w_{ij}^l}} \\ &=\delta_i^l\cdot\sigma'(z_i^l)\cdot{x_j^{l-1}} \\ \frac{\partial{L}}{\partial{b_i^l}} &=\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot \frac{\partial{z_i^l}}{\partial{b_i^l}} \\ &=\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot \frac{\partial(b_i^l+W_i^lX^{l-1})}{\partial{b_i^l}} \\ &=\delta_i^l\cdot\sigma'(z_i^l) \\ \frac{\partial{L}}{\partial{x_j^{l-1}}} &=\sum\limits_{i=0}^{N_l-1}\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot \frac{\partial{z_i^l}}{\partial{x_j^{l-1}}} \\ &=\sum\limits_{i=0}^{N_l-1}\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot \frac{\partial(W_i^lX^{l-1}+b_i^l)}{\partial{x_j^{l-1}}} \\ &=\sum\limits_{i=0}^{N_l-1}\delta_i^l\cdot\sigma'(z_i^l)\cdot{w_{ij}^l} \end{split}

outcomes

L=[Lw1Lb1LwlLbl]\begin{equation} \nabla{L}=\begin{bmatrix} \frac{\partial{L}}{\partial{w^1}} \\[0.8em] \frac{\partial{L}}{\partial{b^1}} \\[0.5em] \vdots \\[0.5em] \frac{\partial{L}}{\partial{w^l}} \\[0.8em] \frac{\partial{L}}{\partial{b^l}} \end{bmatrix} \end{equation}
Mini-Batch

Utilize parallel computing (GPU) to speed up training process:

  • Divide training data into mini-batches.
  • Update weights and biases for each mini-batch.
  • Repeat until convergence.
class Network(object):
def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
"""Train the neural network using mini-batch stochastic
gradient descent. The ``training_data`` is a list of tuples
``(x, y)`` representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If ``test_data`` is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data:
n_test = len(test_data)
n = len(training_data)
for j in range(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k : k + mini_batch_size]
for k in range(0, n, mini_batch_size)
]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print(
"Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test)
)
else:
print("Epoch {0} complete".format(j))

def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backpropagation(x, y)
nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [
w - (eta / len(mini_batch)) * nw for w, nw in zip(self.weights, nabla_w)
]
self.biases = [
b - (eta / len(mini_batch)) * nb for b, nb in zip(self.biases, nabla_b)
]

def backpropagation(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation) + b
zs.append(z)
activation = self.non_linearity(z)
activations.append(activation)
# backward pass
delta = self.cost_derivative(activations[-1], y) * self.d_non_linearity(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last