Hongyi-Lee Machine Learning Spring 2021 Mandarin

learning-note

字数统计: 1.2k阅读时长: 6 min

 2021/03/17   Share

Hongyi-Lee Machine Learning Spring Mandarin —Notes

Lesson 1Basic Conceptions OF ML

Different type of Functions

Regression: The function outputs a scalar

Screen Shot 2021-03-16 at 23.45.08

Classification: Given options(classes), the function outputs the correct one.

Screen Shot 2021-03-16 at 23.46.21

Screen Shot 2021-03-16 at 23.47.04

How to find the Function

Three steps

Function with Unknown Parameters
Define Loss from Training Data
- Loss is a function of parameters: L( b,w )
- Loss: how good a set of value is.
Loss: $L = \frac{1}{N}\sum{n}e{n}$

Types of Loss
- MAE: L is mean absolute error(MAE) $e = |y-\hat{y}|$
- MSE: L is mean square error(MSE) $e = (y-\hat{y})^{2}$
- if y and $\hat{y}$ both probability distributons —> Cross-entropy
Optimization

the way to find a best w and b: gradient descent
- randomly pick a initial value of w0
- compute $\frac{\partial{L}}{\partial{w}}|_{w=w^0}$
  - Negative —> increase w
  - Positiive —> decrease w
  - learn rate : hyperparameters —> can modify be human
- update w iteratively

disadvantage of gradient descent: may sometimes can not find the absolutely right best w* —> local minima

Screen Shot 2021-03-17 at 22.21.45

Same steps of the finding of b*

Above summary is called Train

**Gradually considering more days of a circle to compute the best w* (different models)**

Less2 Basic Conceptions OF ML (second)

Model Bias

Screen Shot 2021-03-18 at 22.22.58

Screen Shot 2021-03-19 at 23.51.31

Approximate cotinuous curve by a piecewise linear curve.

To have good approximation, we need sufficient pices.

Screen Shot 2021-03-20 at 00.05.44

using multiple lines to fitting the red curve

New Model: more Features

using /theta to represent all unknow parameters

Loss

Loss is a function of parameters $L\theta$

Loss means how good a set of values is.

Optimization of New Model

Activation function

ReLU = 2 times Sigmoid

complexity model

Why don’t we go deeper? —— Overfitting

Less3 Mission Strategy OF ML

Screen Shot 2021-03-18 at 22.17.31

Screen Shot 2021-03-21 at 16.38.10

how to improve the currection

Screen Shot 2021-03-21 at 16.39.41

Screen Shot 2021-03-21 at 16.44.55

56-layer can do the things 20-layer do

Screen Shot 2021-03-21 at 16.49.46

Screen Shot 2021-03-21 at 16.51.26

The way to solve overfitting

Screen Shot 2021-03-21 at 16.53.15

constrained model to solve overfitting

Screen Shot 2021-03-21 at 16.54.48

Screen Shot 2021-03-21 at 16.56.10

Screen Shot 2021-03-21 at 16.58.42

Bias-Complexity

Screen Shot 2021-03-21 at 17.13.34

Screen Shot 2021-03-21 at 17.24.52

good way to improve currection

Split Training Set into Training Set and Validation Set

Screen Shot 2021-03-21 at 18.01.28

Screen Shot 2021-03-21 at 18.03.57

Mismatch

Screen Shot 2021-03-21 at 18.06.38

Lesson4 local minima saddle point

Critical point

using Tayler Series Approximation to approximately represent the Loss Function

$(\theta-\theta^{‘})^T g $ will disappear when Loss is at critical point

eigen value

Example

Higher Dimension

saddle point in higher dimension

Lesson5 Batch && momentum

Large Batch need smaller time for one epoch while Small Batch need more time for one epoch

What’s wrong with large batch size? —> Optimization Fails

Small batch: when one batch cracked in a critical point, another somehow will over come it.

small batch is better on test data.

small batch is worse on test data. —> overfitting

Many differences on Small Batch V.S. Large Batch

Momentum

Obey the inverse orientation of Gradient Descent to update the Next New Gradient

Lesson6 Learn rate (Error surface is rugged)

Tips for training: Adaptive Learning Rate.

Training stuck != Small Gradient

Error surface is left side of above pic.

we can see the gradient vibrate between the two peak.

Learning rate too big or too small.

Screen Shot 2021-03-25 at 10.32.12

Screen Shot 2021-03-25 at 10.42.42

Screen Shot 2021-03-25 at 10.43.55

Learning rate adapts dynamically

Screen Shot 2021-03-25 at 10.45.01

RMSProp is unpublished optimization algorithm designed for neural networks, first proposed by Geoff Hinton in lecture 6 of online course.

Screen Shot 2021-03-25 at 10.50.49

Screen Shot 2021-03-25 at 10.52.27

Screen Shot 2021-03-25 at 10.52.56

Screen Shot 2021-03-25 at 11.02.06

Screen Shot 2021-03-25 at 11.04.22

Learning Rate Scheduling

Decay
Warm up

Screen Shot 2021-03-25 at 11.05.33

Screen Shot 2021-03-25 at 11.18.04

Screen Shot 2021-03-25 at 11.19.27

Screen Shot 2021-03-25 at 11.22.35

Lesson 7 Loss Function Influence

Screen Shot 2021-03-25 at 13.01.23

Screen Shot 2021-03-25 at 13.01.44

Screen Shot 2021-03-25 at 13.05.14

Screen Shot 2021-03-25 at 13.06.17

Screen Shot 2021-03-25 at 13.07.45

Screen Shot 2021-03-25 at 13.11.46

Mean Square Error(MSE) VS Cross-entropy

Screen Shot 2021-03-25 at 13.16.52

Lesson8 CNN - Image Classification

Screen Shot 2021-03-26 at 21.12.20

Screen Shot 2021-03-26 at 21.23.33

Screen Shot 2021-03-26 at 21.29.28

All weight == input x kernal x out.

Identifying some critical patterns to ensure a certain object.

Screen Shot 2021-03-26 at 21.37.56

Screen Shot 2021-03-26 at 21.47.26

Receptive field design

can different neurons have different sizes of receptive field.
cover only some channels.
not square receptive field.

Typical Setting of Receptive neural

kernal size usually is 3x3
one receptive neural typically has 64 neurons
stride
overlap
padding: when over the whole image

Screen Shot 2021-03-26 at 21.57.52

The same patterns appear in different regions.

does every region need a pattern detection?

Screen Shot 2021-03-26 at 21.59.40

Solution to above issue ==> Parameter Sharing

same weight in different regions

different weight in two neurons with the same receptive field == two neurons with the same receptive field would not share parameters.

Screen Shot 2021-03-26 at 22.03.48

Screen Shot 2021-03-26 at 22.10.09

Benefit of Convolutional Layer

Screen Shot 2021-03-26 at 22.12.56

Another good story to describe Convolutional Layer

Screen Shot 2021-03-26 at 22.16.50

Feature Map

Screen Shot 2021-03-26 at 22.17.49

1channel x 64 filter —> “image” with 64 channels x 3x3x64 filter —> “image” with 64x3x3x64 channels —>…

When with deeper Convolutional Layers.

Screen Shot 2021-03-26 at 22.24.21

Screen Shot 2021-03-26 at 22.29.13

Screen Shot 2021-03-26 at 22.29.48

Screen Shot 2021-03-26 at 22.30.37

Pooling

Screen Shot 2021-03-26 at 22.31.40

Max pooling: select the max num of a receptive filed.

Screen Shot 2021-03-26 at 22.33.07

pooling: do the things that makes the image smaller.

Screen Shot 2021-03-26 at 22.34.44

pooling: sometimes make some damages in the result of Image Identification.

The Whole CNN

Screen Shot 2021-03-26 at 22.36.52

Flatten: makes the latter matrix like vector.

Then do with a softmax.

Screen Shot 2021-03-26 at 22.41.46

Why CNN for Go Playing?

some patterns are much smaller the the whole image
The same patterns apear in different regions.

Screen Shot 2021-03-26 at 22.43.10

Alpha Go does not use Pooling…

Screen Shot 2021-03-26 at 22.46.49

Screen Shot 2021-03-26 at 22.49.30

CNN need data augmentation in Image Identification work.

Screen Shot 2021-03-26 at 22.51.40

Lesson9 Self-Attention (part1)

货拉拉拉不拉拉布拉多犬呢

When the input is a set of vectors ( not one vector )

Screen Shot 2021-03-28 at 11.01.38

Screen Shot 2021-03-28 at 11.04.11

Screen Shot 2021-03-28 at 11.05.53

Screen Shot 2021-03-28 at 11.06.52

Screen Shot 2021-03-28 at 11.07.42

What is the output?

N to N

Screen Shot 2021-03-28 at 11.09.58

N to One

Screen Shot 2021-03-28 at 11.11.38

N to somewhat. seq2seq

seq2seq:

translate
speech recognization

Screen Shot 2021-03-28 at 11.12.37

Sequence Labeling: N2N

Screen Shot 2021-03-28 at 11.56.56

Self-attention

Attention is all you need.

Screen Shot 2021-03-28 at 12.04.56

First of all, input into a self-attention layer; then it will output a series of different vectors; a series of different vectors input into Full Connection Layers.

FC and self-attention can be traded off to use. ( alternate use )

Screen Shot 2021-03-28 at 12.07.53

input of Self-attention can be either input or a hidden layer.

Screen Shot 2021-03-28 at 12.11.06

Screen Shot 2021-03-28 at 12.14.22

how to calculate alpha

Screen Shot 2021-03-28 at 12.16.56

Screen Shot 2021-03-28 at 12.32.30

Screen Shot 2021-03-28 at 12.33.46

Screen Shot 2021-03-28 at 12.35.30

One of the input vector ( V ) will be more sammilar to the result ( b ), if its attention is the biggest one.

Lesson10 Self-Attention ( part2 )

Screen Shot 2021-03-28 at 12.38.21

Screen Shot 2021-03-28 at 12.41.25

Screen Shot 2021-03-28 at 12.43.08

Screen Shot 2021-03-28 at 12.45.16

Screen Shot 2021-03-28 at 12.48.52

Screen Shot 2021-03-28 at 12.50.35

Multi-head Self-attention

Screen Shot 2021-03-28 at 12.53.36

Screen Shot 2021-03-28 at 12.58.02

Positional Encoding

Screen Shot 2021-03-28 at 13.01.23

Screen Shot 2021-03-28 at 13.02.37

Self-attention for Speech

Screen Shot 2021-03-28 at 13.04.27

Self-attention for image

Screen Shot 2021-03-28 at 13.05.56

Screen Shot 2021-03-28 at 13.06.40

Self-attention V.S. CNN

Screen Shot 2021-03-28 at 13.08.32

Screen Shot 2021-03-28 at 13.09.27

Screen Shot 2021-03-28 at 13.12.16

Self-attention V.S. RNN

Self-attention for Graph

Screen Shot 2021-03-28 at 13.18.35

Screen Shot 2021-03-28 at 13.20.39

Lesson11 Transformer 1

transformer 就是一个 sequence to sequence 的model

即 Seq2seq

就是说在 input 一个 sequence 之后再输出一个 sequence，这个 output 的 sequence 的长度是由模型所决定的，是不知道多长的。

Screen Shot 2021-04-22 at 10.51.15

语音合成：Text to Speech TTS

Screen Shot 2021-04-22 at 12.24.54

Screen Shot 2021-04-22 at 12.29.06

Lesson12 Transformer2

Sequence to Sequence 的 Encoder

encoder 首先要意义就是 input 一排向量，然后输出一排向量。可以使用 self-attention 或者 RNN 、CNN

Screen Shot 2021-04-22 at 12.34.14

简单化模型：

Screen Shot 2021-04-22 at 12.36.29

transformer 中

Screen Shot 2021-04-22 at 12.43.20

Decoder. autoregressive decoder

decoder 会把自己的输出当作下一个输入

Screen Shot 2021-04-22 at 13.06.35

Screen Shot 2021-04-29 at 23.55.15

Encoder 与 Decoder 的比较

Screen Shot 2021-04-29 at 23.55.52

Masked 的区别在于：

Screen Shot 2021-04-29 at 23.58.01

产生 b1 的时候不考虑后面的 a2，a3，a4，同理产生 b2，b3，b4.

需要添加特殊符号来使 decoder 停止

Screen Shot 2021-04-30 at 00.02.46

Auto regression 与 Non-Auto regression

encoder 与 decoder 怎样传递资讯

Screen Shot 2021-04-30 at 00.10.14

Training

最小化所有的 cross entropy 的总和。

Screen Shot 2021-04-30 at 00.18.13

decoder 的时候会输入正确的信息。

Tips

Screen Shot 2021-04-30 at 00.20.05

Screen Shot 2021-04-30 at 00.25.43

Screen Shot 2021-04-30 at 00.28.13

Screen Shot 2021-04-30 at 00.32.53

这种一种方法是给 decoder 的输入加入一些错误的东西。即使用 scheduled sampling.

GAN

Screen Shot 2022-03-31 at 20.16.22

希望生成的内容与给定的内容越接近越好。

GAN 训练技巧

js Divergence有问题在于 log(G,D)在没有 overlap之前都是 log2

换作为 Wasserstein 利用距离最小的计算来优化这种损失计算方式。

Screen Shot 2022-03-31 at 21.34.47

Next Post

Docker notes
Previous Post

CPP refresh

CATALOG

1. Hongyi-Lee Machine Learning Spring Mandarin —Notes
1.5. Lesson5 Batch && momentum
1. 1.5.1. Momentum
1.6. Lesson6 Learn rate (Error surface is rugged)
1.7. Lesson 7 Loss Function Influence
1. 1.7.1. Mean Square Error(MSE) VS Cross-entropy
1.8. Lesson8 CNN - Image Classification
1.9. Lesson9 Self-Attention (part1)
1.10. Lesson10 Self-Attention ( part2 )
1.11. Lesson11 Transformer 1
1.12. Lesson12 Transformer2
1.13. GAN

