GuoXin Li's Blog

Hongyi-Lee Machine Learning Spring 2021 Mandarin

字数统计: 1.2k阅读时长: 6 min
2021/03/17 Share

Hongyi-Lee Machine Learning Spring Mandarin —Notes

Lesson 1Basic Conceptions OF ML

Different type of Functions

Regression: The function outputs a scalar

Screen Shot 2021-03-16 at 23.45.08

Classification: Given options(classes), the function outputs the correct one.

Screen Shot 2021-03-16 at 23.46.21

Screen Shot 2021-03-16 at 23.47.04

How to find the Function

Three steps

  • Function with Unknown Parameters

  • Define Loss from Training Data

    • Loss is a function of parameters: L( b,w )
    • Loss: how good a set of value is.


    Loss: $L = \frac{1}{N}\sum{n}e{n}$

    Types of Loss

    • MAE: L is mean absolute error(MAE) $e = |y-\hat{y}|$
    • MSE: L is mean square error(MSE) $e = (y-\hat{y})^{2}$
    • if y and $\hat{y}$ both probability distributons —> Cross-entropy


  • Optimization

    the way to find a best w and b: gradient descent

    • randomly pick a initial value of w0

    • compute $\frac{\partial{L}}{\partial{w}}|_{w=w^0}$

      • Negative —> increase w
      • Positiive —> decrease w
      • learn rate : hyperparameters —> can modify be human

      Screen Shot 2021-03-17 at 22.12.31

    • update w iteratively

disadvantage of gradient descent: may sometimes can not find the absolutely right best w* —> local minima

Screen Shot 2021-03-17 at 22.21.45

Same steps of the finding of b*


Above summary is called Train


Gradually considering more days of a circle to compute the best w* (different models)


Less2 Basic Conceptions OF ML (second)

Model Bias

Screen Shot 2021-03-18 at 22.22.58

Screen Shot 2021-03-19 at 23.51.31

Approximate cotinuous curve by a piecewise linear curve.

To have good approximation, we need sufficient pices.

Screen Shot 2021-03-20 at 00.05.44

using multiple lines to fitting the red curve


New Model: more Features









using /theta to represent all unknow parameters




Loss is a function of parameters $L\theta$

Loss means how good a set of values is.


Optimization of New Model







Activation function

ReLU = 2 times Sigmoid



complexity model







Why don’t we go deeper? —— Overfitting


Less3 Mission Strategy OF ML

Screen Shot 2021-03-18 at 22.17.31

Screen Shot 2021-03-21 at 16.38.10

how to improve the currection

Screen Shot 2021-03-21 at 16.39.41

Screen Shot 2021-03-21 at 16.44.55

56-layer can do the things 20-layer do

Screen Shot 2021-03-21 at 16.49.46

Screen Shot 2021-03-21 at 16.51.26

The way to solve overfitting

Screen Shot 2021-03-21 at 16.53.15

constrained model to solve overfitting

Screen Shot 2021-03-21 at 16.54.48

Screen Shot 2021-03-21 at 16.56.10

Screen Shot 2021-03-21 at 16.58.42


Screen Shot 2021-03-21 at 17.13.34

Screen Shot 2021-03-21 at 17.24.52

good way to improve currection

Split Training Set into Training Set and Validation Set

Screen Shot 2021-03-21 at 18.01.28

Screen Shot 2021-03-21 at 18.03.57


Screen Shot 2021-03-21 at 18.06.38

Lesson4 local minima saddle point

Critical point


using Tayler Series Approximation to approximately represent the Loss Function


$(\theta-\theta^{‘})^T g $ will disappear when Loss is at critical point


eigen value






Higher Dimension

saddle point in higher dimension



Lesson5 Batch && momentum




Large Batch need smaller time for one epoch while Small Batch need more time for one epoch



What’s wrong with large batch size? —> Optimization Fails


Small batch: when one batch cracked in a critical point, another somehow will over come it.


small batch is better on test data.

small batch is worse on test data. —> overfitting


Many differences on Small Batch V.S. Large Batch




Obey the inverse orientation of Gradient Descent to update the Next New Gradient



Lesson6 Learn rate (Error surface is rugged)

Tips for training: Adaptive Learning Rate.

Training stuck != Small Gradient


Error surface is left side of above pic.

we can see the gradient vibrate between the two peak.


Learning rate too big or too small.

Screen Shot 2021-03-25 at 10.32.12

Screen Shot 2021-03-25 at 10.42.42

Screen Shot 2021-03-25 at 10.43.55

Learning rate adapts dynamically

Screen Shot 2021-03-25 at 10.45.01

RMSProp is unpublished optimization algorithm designed for neural networks, first proposed by Geoff Hinton in lecture 6 of online course.

Screen Shot 2021-03-25 at 10.50.49

Screen Shot 2021-03-25 at 10.52.27

Screen Shot 2021-03-25 at 10.52.56

Screen Shot 2021-03-25 at 11.02.06

Screen Shot 2021-03-25 at 11.04.22

Learning Rate Scheduling

  • Decay
  • Warm up

Screen Shot 2021-03-25 at 11.05.33

Screen Shot 2021-03-25 at 11.18.04

Screen Shot 2021-03-25 at 11.19.27

Screen Shot 2021-03-25 at 11.22.35

Lesson 7 Loss Function Influence

Screen Shot 2021-03-25 at 13.01.23

Screen Shot 2021-03-25 at 13.01.44

Screen Shot 2021-03-25 at 13.05.14

Screen Shot 2021-03-25 at 13.06.17

Screen Shot 2021-03-25 at 13.07.45

Screen Shot 2021-03-25 at 13.11.46

Mean Square Error(MSE) VS Cross-entropy

Screen Shot 2021-03-25 at 13.16.52

Lesson8 CNN - Image Classification

Screen Shot 2021-03-26 at 21.12.20

Screen Shot 2021-03-26 at 21.23.33

Screen Shot 2021-03-26 at 21.29.28

All weight == input x kernal x out.

Identifying some critical patterns to ensure a certain object.


Screen Shot 2021-03-26 at 21.37.56

Screen Shot 2021-03-26 at 21.47.26

Receptive field design

  • can different neurons have different sizes of receptive field.
  • cover only some channels.
  • not square receptive field.

Typical Setting of Receptive neural

  • kernal size usually is 3x3
  • one receptive neural typically has 64 neurons

  • stride

  • overlap
  • padding: when over the whole image

Screen Shot 2021-03-26 at 21.57.52

The same patterns appear in different regions.

does every region need a pattern detection?

Screen Shot 2021-03-26 at 21.59.40

Solution to above issue ==> Parameter Sharing

same weight in different regions

different weight in two neurons with the same receptive field == two neurons with the same receptive field would not share parameters.

Screen Shot 2021-03-26 at 22.03.48

Screen Shot 2021-03-26 at 22.10.09

Benefit of Convolutional Layer

Screen Shot 2021-03-26 at 22.12.56

Another good story to describe Convolutional Layer


Screen Shot 2021-03-26 at 22.16.50

Feature Map

Screen Shot 2021-03-26 at 22.17.49

1channel x 64 filter —> “image” with 64 channels x 3x3x64 filter —> “image” with 64x3x3x64 channels —>…

When with deeper Convolutional Layers.

Screen Shot 2021-03-26 at 22.24.21

Screen Shot 2021-03-26 at 22.29.13

Screen Shot 2021-03-26 at 22.29.48

Screen Shot 2021-03-26 at 22.30.37


Screen Shot 2021-03-26 at 22.31.40

Max pooling: select the max num of a receptive filed.

Screen Shot 2021-03-26 at 22.33.07

pooling: do the things that makes the image smaller.

Screen Shot 2021-03-26 at 22.34.44

pooling: sometimes make some damages in the result of Image Identification.

The Whole CNN

Screen Shot 2021-03-26 at 22.36.52

Flatten: makes the latter matrix like vector.

Then do with a softmax.

Screen Shot 2021-03-26 at 22.41.46

Why CNN for Go Playing?

  • some patterns are much smaller the the whole image
  • The same patterns apear in different regions.

Screen Shot 2021-03-26 at 22.43.10

Alpha Go does not use Pooling…

Screen Shot 2021-03-26 at 22.46.49

Screen Shot 2021-03-26 at 22.49.30

CNN need data augmentation in Image Identification work.

Screen Shot 2021-03-26 at 22.51.40

Lesson9 Self-Attention (part1)


When the input is a set of vectors ( not one vector )

Screen Shot 2021-03-28 at 11.01.38

Screen Shot 2021-03-28 at 11.04.11

Screen Shot 2021-03-28 at 11.05.53

Screen Shot 2021-03-28 at 11.06.52

Screen Shot 2021-03-28 at 11.07.42

What is the output?

N to N

Screen Shot 2021-03-28 at 11.09.58

N to One

Screen Shot 2021-03-28 at 11.11.38

N to somewhat. seq2seq


  • translate
  • speech recognization

Screen Shot 2021-03-28 at 11.12.37

Sequence Labeling: N2N

Screen Shot 2021-03-28 at 11.56.56


Attention is all you need.


Screen Shot 2021-03-28 at 12.04.56

First of all, input into a self-attention layer; then it will output a series of different vectors; a series of different vectors input into Full Connection Layers.

FC and self-attention can be traded off to use. ( alternate use )

Screen Shot 2021-03-28 at 12.07.53

input of Self-attention can be either input or a hidden layer.

Screen Shot 2021-03-28 at 12.11.06

Screen Shot 2021-03-28 at 12.14.22

how to calculate alpha

Screen Shot 2021-03-28 at 12.16.56

Screen Shot 2021-03-28 at 12.32.30

Screen Shot 2021-03-28 at 12.33.46

Screen Shot 2021-03-28 at 12.35.30

One of the input vector ( V ) will be more sammilar to the result ( b ), if its attention is the biggest one.

Lesson10 Self-Attention ( part2 )

Screen Shot 2021-03-28 at 12.38.21

Screen Shot 2021-03-28 at 12.41.25

Screen Shot 2021-03-28 at 12.43.08

Screen Shot 2021-03-28 at 12.45.16

Screen Shot 2021-03-28 at 12.48.52

Screen Shot 2021-03-28 at 12.50.35

Multi-head Self-attention


Screen Shot 2021-03-28 at 12.53.36

Screen Shot 2021-03-28 at 12.58.02

Positional Encoding

Screen Shot 2021-03-28 at 13.01.23

Screen Shot 2021-03-28 at 13.02.37

Self-attention for Speech

Screen Shot 2021-03-28 at 13.04.27

Self-attention for image

Screen Shot 2021-03-28 at 13.05.56

Screen Shot 2021-03-28 at 13.06.40

Self-attention V.S. CNN

Screen Shot 2021-03-28 at 13.08.32

Screen Shot 2021-03-28 at 13.09.27

Screen Shot 2021-03-28 at 13.12.16

Self-attention V.S. RNN


Self-attention for Graph

Screen Shot 2021-03-28 at 13.18.35

Screen Shot 2021-03-28 at 13.20.39

Lesson11 Transformer 1

transformer 就是一个 sequence to sequence 的model

即 Seq2seq

就是说在 input 一个 sequence 之后再输出一个 sequence,这个 output 的 sequence 的长度是由模型所决定的,是不知道多长的。

Screen Shot 2021-04-22 at 10.51.15

语音合成:Text to Speech TTS

Screen Shot 2021-04-22 at 12.24.54

Screen Shot 2021-04-22 at 12.29.06

Lesson12 Transformer2

Sequence to Sequence 的 Encoder

encoder 首先要意义就是 input 一排向量,然后输出一排向量。可以使用 self-attention 或者 RNN 、CNN

Screen Shot 2021-04-22 at 12.34.14


Screen Shot 2021-04-22 at 12.36.29

transformer 中


Screen Shot 2021-04-22 at 12.43.20

Decoder. autoregressive decoder


decoder 会把自己的输出当作下一个输入

Screen Shot 2021-04-22 at 13.06.35

Screen Shot 2021-04-29 at 23.55.15

Encoder 与 Decoder 的比较

Screen Shot 2021-04-29 at 23.55.52

Masked 的区别在于:

Screen Shot 2021-04-29 at 23.58.01

产生 b1 的时候不考虑后面的 a2,a3,a4,同理产生 b2,b3,b4.

需要添加特殊符号来使 decoder 停止


Screen Shot 2021-04-30 at 00.02.46

Auto regression 与 Non-Auto regression


encoder 与 decoder 怎样传递资讯


Screen Shot 2021-04-30 at 00.10.14




最小化 所有的 cross entropy 的总和。

Screen Shot 2021-04-30 at 00.18.13

decoder 的时候会输入正确的信息。


Screen Shot 2021-04-30 at 00.20.05

Screen Shot 2021-04-30 at 00.25.43

Screen Shot 2021-04-30 at 00.28.13


Screen Shot 2021-04-30 at 00.32.53

这种一种方法是给 decoder 的输入加入一些错误的东西。即使用 scheduled sampling.



Screen Shot 2022-03-31 at 20.16.22


GAN 训练技巧

js Divergence有问题在于 log(G,D)在没有 overlap之前都是 log2

换作为 Wasserstein 利用距离最小的计算来优化这种损失计算方式。

Screen Shot 2022-03-31 at 21.34.47


  1. 1. Hongyi-Lee Machine Learning Spring Mandarin —Notes
    1. 1.1. Lesson 1Basic Conceptions OF ML
      1. 1.1.1. Different type of Functions
      2. 1.1.2. How to find the Function
        1. Three steps
      3. 1.1.3. Gradually considering more days of a circle to compute the best w* (different models)
    2. 1.2. Less2 Basic Conceptions OF ML (second)
      1. 1.2.1. using /theta to represent all unknow parameters
      2. 1.2.2. Loss
      3. 1.2.3. Optimization of New Model
      4. 1.2.4. Activation function
      5. 1.2.5. complexity model
      6. 1.2.6. Why don’t we go deeper? —— Overfitting
    3. 1.3. Less3 Mission Strategy OF ML
      1. 1.3.1. Bias-Complexity
      2. 1.3.2. good way to improve currection
      3. 1.3.3. Mismatch
    4. 1.4. Lesson4 local minima saddle point
      1. eigen value
    5. 1.4.1. Higher Dimension
  2. 1.5. Lesson5 Batch && momentum
    1. 1.5.1. Momentum
  3. 1.6. Lesson6 Learn rate (Error surface is rugged)
    1. 1.6.1. Training stuck != Small Gradient
    2. 1.6.2. Learning rate adapts dynamically
    3. 1.6.3. Learning Rate Scheduling
  4. 1.7. Lesson 7 Loss Function Influence
    1. 1.7.1. Mean Square Error(MSE) VS Cross-entropy
  5. 1.8. Lesson8 CNN - Image Classification
    1. 1.8.1. Receptive field design
    2. 1.8.2. Typical Setting of Receptive neural
    3. 1.8.3. The same patterns appear in different regions.
    4. 1.8.4. does every region need a pattern detection?
    5. 1.8.5. Benefit of Convolutional Layer
    6. 1.8.6. Another good story to describe Convolutional Layer
    7. 1.8.7. Feature Map
    8. 1.8.8. Pooling
    9. 1.8.9. The Whole CNN
  6. 1.9. Lesson9 Self-Attention (part1)
    1. 1.9.1. What is the output?
    2. 1.9.2. Sequence Labeling: N2N
    3. 1.9.3. Self-attention
  7. 1.10. Lesson10 Self-Attention ( part2 )
    1. 1.10.1. Positional Encoding
    2. 1.10.2. Self-attention for Speech
    3. 1.10.3. Self-attention for image
    4. 1.10.4. Self-attention V.S. CNN
    5. 1.10.5. Self-attention V.S. RNN
    6. 1.10.6. Self-attention for Graph
  8. 1.11. Lesson11 Transformer 1
  9. 1.12. Lesson12 Transformer2
    1. 1.12.1. Sequence to Sequence 的 Encoder
    2. 1.12.2. Decoder. autoregressive decoder
    3. 1.12.3. encoder 与 decoder 怎样传递资讯
    4. 1.12.4. Training
    5. 1.12.5. Tips
  10. 1.13. GAN