Hongyi-Lee Machine Learning Spring Mandarin —Notes
Lesson 1Basic Conceptions OF ML
Different type of Functions
Regression: The function outputs a scalar
Classification: Given options(classes), the function outputs the correct one.
How to find the Function
Three steps
Function with Unknown Parameters
Define Loss from Training Data
- Loss is a function of parameters: L( b,w )
- Loss: how good a set of value is.
Loss: $L = \frac{1}{N}\sum{n}e{n}$
Types of Loss
- MAE: L is mean absolute error(MAE) $e = |y-\hat{y}|$
- MSE: L is mean square error(MSE) $e = (y-\hat{y})^{2}$
- if y and $\hat{y}$ both probability distributons —> Cross-entropy
Optimization
the way to find a best w and b: gradient descent
randomly pick a initial value of w0
compute $\frac{\partial{L}}{\partial{w}}|_{w=w^0}$
- Negative —> increase w
- Positiive —> decrease w
- learn rate : hyperparameters —> can modify be human
update w iteratively
disadvantage of gradient descent: may sometimes can not find the absolutely right best w* —> local minima
Same steps of the finding of b*
Above summary is called Train
Gradually considering more days of a circle to compute the best w* (different models)
Less2 Basic Conceptions OF ML (second)
Model Bias
Approximate cotinuous curve by a piecewise linear curve.
To have good approximation, we need sufficient pices.
using multiple lines to fitting the red curve
New Model: more Features
using /theta to represent all unknow parameters
Loss
Loss is a function of parameters $L\theta$
Loss means how good a set of values is.
Optimization of New Model
Activation function
ReLU = 2 times Sigmoid
complexity model
Why don’t we go deeper? —— Overfitting
Less3 Mission Strategy OF ML
how to improve the currection
56-layer can do the things 20-layer do
The way to solve overfitting
constrained model to solve overfitting
Bias-Complexity
good way to improve currection
Split Training Set into Training Set and Validation Set
Mismatch
Lesson4 local minima saddle point
Critical point
using Tayler Series Approximation to approximately represent the Loss Function
$(\theta-\theta^{‘})^T g $ will disappear when Loss is at critical point
eigen value
Example
Higher Dimension
saddle point in higher dimension
Lesson5 Batch && momentum
Large Batch need smaller time for one epoch while Small Batch need more time for one epoch
What’s wrong with large batch size? —> Optimization Fails
Small batch: when one batch cracked in a critical point, another somehow will over come it.
small batch is better on test data.
small batch is worse on test data. —> overfitting
Many differences on Small Batch V.S. Large Batch
Momentum
Obey the inverse orientation of Gradient Descent to update the Next New Gradient
Lesson6 Learn rate (Error surface is rugged)
Tips for training: Adaptive Learning Rate.
Training stuck != Small Gradient
Error surface is left side of above pic.
we can see the gradient vibrate between the two peak.
Learning rate too big or too small.
Learning rate adapts dynamically
RMSProp is unpublished optimization algorithm designed for neural networks, first proposed by Geoff Hinton in lecture 6 of online course.
Learning Rate Scheduling
- Decay
- Warm up
Lesson 7 Loss Function Influence
Mean Square Error(MSE) VS Cross-entropy
Lesson8 CNN - Image Classification
All weight == input x kernal x out.
Identifying some critical patterns to ensure a certain object.
Receptive field design
- can different neurons have different sizes of receptive field.
- cover only some channels.
- not square receptive field.
Typical Setting of Receptive neural
- kernal size usually is 3x3
one receptive neural typically has 64 neurons
stride
- overlap
- padding: when over the whole image
The same patterns appear in different regions.
does every region need a pattern detection?
Solution to above issue ==> Parameter Sharing
same weight in different regions
different weight in two neurons with the same receptive field == two neurons with the same receptive field would not share parameters.
Benefit of Convolutional Layer
Another good story to describe Convolutional Layer
Feature Map
1channel x 64 filter —> “image” with 64 channels x 3x3x64 filter —> “image” with 64x3x3x64 channels —>…
When with deeper Convolutional Layers.
Pooling
Max pooling: select the max num of a receptive filed.
pooling: do the things that makes the image smaller.
pooling: sometimes make some damages in the result of Image Identification.
The Whole CNN
Flatten: makes the latter matrix like vector.
Then do with a softmax.
Why CNN for Go Playing?
- some patterns are much smaller the the whole image
- The same patterns apear in different regions.
Alpha Go does not use Pooling…
CNN need data augmentation in Image Identification work.
Lesson9 Self-Attention (part1)
货拉拉拉不拉拉布拉多犬呢
When the input is a set of vectors ( not one vector )
What is the output?
N to N
N to One
N to somewhat. seq2seq
seq2seq:
- translate
- speech recognization
Sequence Labeling: N2N
Self-attention
Attention is all you need.
First of all, input into a self-attention layer; then it will output a series of different vectors; a series of different vectors input into Full Connection Layers.
FC and self-attention can be traded off to use. ( alternate use )
input of Self-attention can be either input or a hidden layer.
how to calculate alpha
One of the input vector ( V ) will be more sammilar to the result ( b ), if its attention is the biggest one.
Lesson10 Self-Attention ( part2 )
Multi-head Self-attention
Positional Encoding
Self-attention for Speech
Self-attention for image
Self-attention V.S. CNN
Self-attention V.S. RNN
Self-attention for Graph
Lesson11 Transformer 1
transformer 就是一个 sequence to sequence 的model
即 Seq2seq
就是说在 input 一个 sequence 之后再输出一个 sequence,这个 output 的 sequence 的长度是由模型所决定的,是不知道多长的。
语音合成:Text to Speech TTS
Lesson12 Transformer2
Sequence to Sequence 的 Encoder
encoder 首先要意义就是 input 一排向量,然后输出一排向量。可以使用 self-attention 或者 RNN 、CNN
简单化模型:
transformer 中
Decoder. autoregressive decoder
decoder 会把自己的输出当作下一个输入
Encoder 与 Decoder 的比较
Masked 的区别在于:
产生 b1 的时候不考虑后面的 a2,a3,a4,同理产生 b2,b3,b4.
需要添加特殊符号来使 decoder 停止
Auto regression 与 Non-Auto regression
encoder 与 decoder 怎样传递资讯
Training
最小化 所有的 cross entropy 的总和。
decoder 的时候会输入正确的信息。
Tips
这种一种方法是给 decoder 的输入加入一些错误的东西。即使用 scheduled sampling.
GAN
希望生成的内容与给定的内容越接近越好。
GAN 训练技巧
js Divergence有问题在于 log(G,D)在没有 overlap之前都是 log2
换作为 Wasserstein 利用距离最小的计算来优化这种损失计算方式。