GuoXin Li's Blog

Hongyi-Lee Machine Learning Spring 2021 Mandarin

字数统计: 1.2k阅读时长: 6 min
2021/03/17 Share

Hongyi-Lee Machine Learning Spring Mandarin —Notes

Lesson 1Basic Conceptions OF ML

Different type of Functions

Regression: The function outputs a scalar

Screen Shot 2021-03-16 at 23.45.08

Classification: Given options(classes), the function outputs the correct one.

Screen Shot 2021-03-16 at 23.46.21

Screen Shot 2021-03-16 at 23.47.04

How to find the Function

Three steps

  • Function with Unknown Parameters

  • Define Loss from Training Data

    • Loss is a function of parameters: L( b,w )
    • Loss: how good a set of value is.

    image-20210317214832354

    Loss: $L = \frac{1}{N}\sum{n}e{n}$

    Types of Loss

    • MAE: L is mean absolute error(MAE) $e = |y-\hat{y}|$
    • MSE: L is mean square error(MSE) $e = (y-\hat{y})^{2}$
    • if y and $\hat{y}$ both probability distributons —> Cross-entropy

    image-20210317220111124

  • Optimization

    the way to find a best w and b: gradient descent

    • randomly pick a initial value of w0

    • compute $\frac{\partial{L}}{\partial{w}}|_{w=w^0}$

      • Negative —> increase w
      • Positiive —> decrease w
      • learn rate : hyperparameters —> can modify be human

      Screen Shot 2021-03-17 at 22.12.31

    • update w iteratively

disadvantage of gradient descent: may sometimes can not find the absolutely right best w* —> local minima

Screen Shot 2021-03-17 at 22.21.45

Same steps of the finding of b*

image-20210317222343212

Above summary is called Train

image-20210317222734598

Gradually considering more days of a circle to compute the best w* (different models)

image-20210317223404414

Less2 Basic Conceptions OF ML (second)

Model Bias

Screen Shot 2021-03-18 at 22.22.58

Screen Shot 2021-03-19 at 23.51.31

Approximate cotinuous curve by a piecewise linear curve.

To have good approximation, we need sufficient pices.

Screen Shot 2021-03-20 at 00.05.44

using multiple lines to fitting the red curve

image-20210320213544357

New Model: more Features

image-20210320214505891

image-20210320214758772

image-20210320214859012

image-20210320214918567

image-20210320215011501

image-20210320215055365

image-20210320215133654

image-20210320215154173

using /theta to represent all unknow parameters

image-20210320215325887

image-20210320215704989

Loss

Loss is a function of parameters $L\theta$

Loss means how good a set of values is.

image-20210320221049011

Optimization of New Model

image-20210320221457139

image-20210320221533290

image-20210320221732179

image-20210320222205683

image-20210320222432506

image-20210320222441121

Activation function

ReLU = 2 times Sigmoid

image-20210320222528754

image-20210320222836306

complexity model

image-20210320222718395

image-20210320222803883

image-20210320223632708

image-20210320223709586

image-20210320223736747

image-20210320223801600

Why don’t we go deeper? —— Overfitting

image-20210320224030451

Less3 Mission Strategy OF ML

Screen Shot 2021-03-18 at 22.17.31

Screen Shot 2021-03-21 at 16.38.10

how to improve the currection

Screen Shot 2021-03-21 at 16.39.41

Screen Shot 2021-03-21 at 16.44.55

56-layer can do the things 20-layer do

Screen Shot 2021-03-21 at 16.49.46

Screen Shot 2021-03-21 at 16.51.26

The way to solve overfitting

Screen Shot 2021-03-21 at 16.53.15

constrained model to solve overfitting

Screen Shot 2021-03-21 at 16.54.48

Screen Shot 2021-03-21 at 16.56.10

Screen Shot 2021-03-21 at 16.58.42

Bias-Complexity

Screen Shot 2021-03-21 at 17.13.34

Screen Shot 2021-03-21 at 17.24.52

good way to improve currection

Split Training Set into Training Set and Validation Set

Screen Shot 2021-03-21 at 18.01.28

Screen Shot 2021-03-21 at 18.03.57

Mismatch

Screen Shot 2021-03-21 at 18.06.38

Lesson4 local minima saddle point

Critical point

image-20210322124818350

using Tayler Series Approximation to approximately represent the Loss Function

image-20210322125244588

$(\theta-\theta^{‘})^T g $ will disappear when Loss is at critical point

image-20210322130418466

eigen value

image-20210322130924505

image-20210322130913155

image-20210322131430124

Example

image-20210322131550636

Higher Dimension

saddle point in higher dimension

image-20210322203114110

image-20210322203719848

Lesson5 Batch && momentum

image-20210322210323144

image-20210322211258840

image-20210322211559760

Large Batch need smaller time for one epoch while Small Batch need more time for one epoch

image-20210322211838918

image-20210322212553416

What’s wrong with large batch size? —> Optimization Fails

image-20210322212810837

Small batch: when one batch cracked in a critical point, another somehow will over come it.

image-20210322213134958

small batch is better on test data.

small batch is worse on test data. —> overfitting

image-20210322213445078

Many differences on Small Batch V.S. Large Batch

image-20210322214258723

Momentum

image-20210322214716830

Obey the inverse orientation of Gradient Descent to update the Next New Gradient

image-20210322214829989

image-20210322215300968

Lesson6 Learn rate (Error surface is rugged)

Tips for training: Adaptive Learning Rate.

Training stuck != Small Gradient

image-20210325094650916

Error surface is left side of above pic.

we can see the gradient vibrate between the two peak.

image-20210325095419288

Learning rate too big or too small.

Screen Shot 2021-03-25 at 10.32.12

Screen Shot 2021-03-25 at 10.42.42

Screen Shot 2021-03-25 at 10.43.55

Learning rate adapts dynamically

Screen Shot 2021-03-25 at 10.45.01

RMSProp is unpublished optimization algorithm designed for neural networks, first proposed by Geoff Hinton in lecture 6 of online course.

Screen Shot 2021-03-25 at 10.50.49

Screen Shot 2021-03-25 at 10.52.27

Screen Shot 2021-03-25 at 10.52.56

Screen Shot 2021-03-25 at 11.02.06

Screen Shot 2021-03-25 at 11.04.22

Learning Rate Scheduling

  • Decay
  • Warm up

Screen Shot 2021-03-25 at 11.05.33

Screen Shot 2021-03-25 at 11.18.04

Screen Shot 2021-03-25 at 11.19.27

Screen Shot 2021-03-25 at 11.22.35

Lesson 7 Loss Function Influence

Screen Shot 2021-03-25 at 13.01.23

Screen Shot 2021-03-25 at 13.01.44

Screen Shot 2021-03-25 at 13.05.14

Screen Shot 2021-03-25 at 13.06.17

Screen Shot 2021-03-25 at 13.07.45

Screen Shot 2021-03-25 at 13.11.46

Mean Square Error(MSE) VS Cross-entropy

Screen Shot 2021-03-25 at 13.16.52

Lesson8 CNN - Image Classification

Screen Shot 2021-03-26 at 21.12.20

Screen Shot 2021-03-26 at 21.23.33

Screen Shot 2021-03-26 at 21.29.28

All weight == input x kernal x out.

Identifying some critical patterns to ensure a certain object.

image-20210326213233416

Screen Shot 2021-03-26 at 21.37.56

Screen Shot 2021-03-26 at 21.47.26

Receptive field design

  • can different neurons have different sizes of receptive field.
  • cover only some channels.
  • not square receptive field.

Typical Setting of Receptive neural

  • kernal size usually is 3x3
  • one receptive neural typically has 64 neurons

  • stride

  • overlap
  • padding: when over the whole image

Screen Shot 2021-03-26 at 21.57.52

The same patterns appear in different regions.

does every region need a pattern detection?

Screen Shot 2021-03-26 at 21.59.40

Solution to above issue ==> Parameter Sharing

same weight in different regions

different weight in two neurons with the same receptive field == two neurons with the same receptive field would not share parameters.

Screen Shot 2021-03-26 at 22.03.48

Screen Shot 2021-03-26 at 22.10.09

Benefit of Convolutional Layer

Screen Shot 2021-03-26 at 22.12.56

Another good story to describe Convolutional Layer

image-20210326221441500

Screen Shot 2021-03-26 at 22.16.50

Feature Map

Screen Shot 2021-03-26 at 22.17.49

1channel x 64 filter —> “image” with 64 channels x 3x3x64 filter —> “image” with 64x3x3x64 channels —>…

When with deeper Convolutional Layers.

Screen Shot 2021-03-26 at 22.24.21

Screen Shot 2021-03-26 at 22.29.13

Screen Shot 2021-03-26 at 22.29.48

Screen Shot 2021-03-26 at 22.30.37

Pooling

Screen Shot 2021-03-26 at 22.31.40

Max pooling: select the max num of a receptive filed.

Screen Shot 2021-03-26 at 22.33.07

pooling: do the things that makes the image smaller.

Screen Shot 2021-03-26 at 22.34.44

pooling: sometimes make some damages in the result of Image Identification.

The Whole CNN

Screen Shot 2021-03-26 at 22.36.52

Flatten: makes the latter matrix like vector.

Then do with a softmax.

Screen Shot 2021-03-26 at 22.41.46

Why CNN for Go Playing?

  • some patterns are much smaller the the whole image
  • The same patterns apear in different regions.

Screen Shot 2021-03-26 at 22.43.10

Alpha Go does not use Pooling…

Screen Shot 2021-03-26 at 22.46.49

Screen Shot 2021-03-26 at 22.49.30

CNN need data augmentation in Image Identification work.

Screen Shot 2021-03-26 at 22.51.40

Lesson9 Self-Attention (part1)

货拉拉拉不拉拉布拉多犬呢

When the input is a set of vectors ( not one vector )

Screen Shot 2021-03-28 at 11.01.38

Screen Shot 2021-03-28 at 11.04.11

Screen Shot 2021-03-28 at 11.05.53

Screen Shot 2021-03-28 at 11.06.52

Screen Shot 2021-03-28 at 11.07.42

What is the output?

N to N

Screen Shot 2021-03-28 at 11.09.58

N to One

Screen Shot 2021-03-28 at 11.11.38

N to somewhat. seq2seq

seq2seq:

  • translate
  • speech recognization

Screen Shot 2021-03-28 at 11.12.37

Sequence Labeling: N2N

Screen Shot 2021-03-28 at 11.56.56

Self-attention

Attention is all you need.

image-20210328232756703

Screen Shot 2021-03-28 at 12.04.56

First of all, input into a self-attention layer; then it will output a series of different vectors; a series of different vectors input into Full Connection Layers.

FC and self-attention can be traded off to use. ( alternate use )

Screen Shot 2021-03-28 at 12.07.53

input of Self-attention can be either input or a hidden layer.

Screen Shot 2021-03-28 at 12.11.06

Screen Shot 2021-03-28 at 12.14.22

how to calculate alpha

Screen Shot 2021-03-28 at 12.16.56

Screen Shot 2021-03-28 at 12.32.30

Screen Shot 2021-03-28 at 12.33.46

Screen Shot 2021-03-28 at 12.35.30

One of the input vector ( V ) will be more sammilar to the result ( b ), if its attention is the biggest one.

Lesson10 Self-Attention ( part2 )

Screen Shot 2021-03-28 at 12.38.21

Screen Shot 2021-03-28 at 12.41.25

Screen Shot 2021-03-28 at 12.43.08

Screen Shot 2021-03-28 at 12.45.16

Screen Shot 2021-03-28 at 12.48.52

Screen Shot 2021-03-28 at 12.50.35

Multi-head Self-attention

image-20210328125717534

Screen Shot 2021-03-28 at 12.53.36

Screen Shot 2021-03-28 at 12.58.02

Positional Encoding

Screen Shot 2021-03-28 at 13.01.23

Screen Shot 2021-03-28 at 13.02.37

Self-attention for Speech

Screen Shot 2021-03-28 at 13.04.27

Self-attention for image

Screen Shot 2021-03-28 at 13.05.56

Screen Shot 2021-03-28 at 13.06.40

Self-attention V.S. CNN

Screen Shot 2021-03-28 at 13.08.32

Screen Shot 2021-03-28 at 13.09.27

Screen Shot 2021-03-28 at 13.12.16

Self-attention V.S. RNN

image-20210328131639631

Self-attention for Graph

Screen Shot 2021-03-28 at 13.18.35

Screen Shot 2021-03-28 at 13.20.39

Lesson11 Transformer 1

transformer 就是一个 sequence to sequence 的model

即 Seq2seq

就是说在 input 一个 sequence 之后再输出一个 sequence,这个 output 的 sequence 的长度是由模型所决定的,是不知道多长的。

Screen Shot 2021-04-22 at 10.51.15

语音合成:Text to Speech TTS

Screen Shot 2021-04-22 at 12.24.54

Screen Shot 2021-04-22 at 12.29.06

Lesson12 Transformer2

Sequence to Sequence 的 Encoder

encoder 首先要意义就是 input 一排向量,然后输出一排向量。可以使用 self-attention 或者 RNN 、CNN

Screen Shot 2021-04-22 at 12.34.14

简单化模型:

Screen Shot 2021-04-22 at 12.36.29

transformer 中

image-20210422124156322

Screen Shot 2021-04-22 at 12.43.20

Decoder. autoregressive decoder

image-20210422130533582

decoder 会把自己的输出当作下一个输入

Screen Shot 2021-04-22 at 13.06.35

Screen Shot 2021-04-29 at 23.55.15

Encoder 与 Decoder 的比较

Screen Shot 2021-04-29 at 23.55.52

Masked 的区别在于:

Screen Shot 2021-04-29 at 23.58.01

产生 b1 的时候不考虑后面的 a2,a3,a4,同理产生 b2,b3,b4.

需要添加特殊符号来使 decoder 停止

image-20210430000228745

Screen Shot 2021-04-30 at 00.02.46

Auto regression 与 Non-Auto regression

image-20210430000657115

encoder 与 decoder 怎样传递资讯

image-20210430000825225

Screen Shot 2021-04-30 at 00.10.14

Training

image-20210430001619998

image-20210430001657890

最小化 所有的 cross entropy 的总和。

Screen Shot 2021-04-30 at 00.18.13

decoder 的时候会输入正确的信息。

Tips

Screen Shot 2021-04-30 at 00.20.05

Screen Shot 2021-04-30 at 00.25.43

Screen Shot 2021-04-30 at 00.28.13

image-20210430003215351

Screen Shot 2021-04-30 at 00.32.53

这种一种方法是给 decoder 的输入加入一些错误的东西。即使用 scheduled sampling.

image-20210430003349640

GAN

Screen Shot 2022-03-31 at 20.16.22

希望生成的内容与给定的内容越接近越好。

GAN 训练技巧

js Divergence有问题在于 log(G,D)在没有 overlap之前都是 log2

换作为 Wasserstein 利用距离最小的计算来优化这种损失计算方式。

Screen Shot 2022-03-31 at 21.34.47

image-20220331213528261

CATALOG
  1. 1. Hongyi-Lee Machine Learning Spring Mandarin —Notes
    1. 1.1. Lesson 1Basic Conceptions OF ML
      1. 1.1.1. Different type of Functions
      2. 1.1.2. How to find the Function
        1. 1.1.2.1. Three steps
      3. 1.1.3. Gradually considering more days of a circle to compute the best w* (different models)
    2. 1.2. Less2 Basic Conceptions OF ML (second)
      1. 1.2.1. using /theta to represent all unknow parameters
      2. 1.2.2. Loss
      3. 1.2.3. Optimization of New Model
      4. 1.2.4. Activation function
      5. 1.2.5. complexity model
      6. 1.2.6. Why don’t we go deeper? —— Overfitting
    3. 1.3. Less3 Mission Strategy OF ML
      1. 1.3.1. Bias-Complexity
      2. 1.3.2. good way to improve currection
      3. 1.3.3. Mismatch
    4. 1.4. Lesson4 local minima saddle point
      1. 1.4.0.1. eigen value
    5. 1.4.1. Higher Dimension
  2. 1.5. Lesson5 Batch && momentum
    1. 1.5.1. Momentum
  3. 1.6. Lesson6 Learn rate (Error surface is rugged)
    1. 1.6.1. Training stuck != Small Gradient
    2. 1.6.2. Learning rate adapts dynamically
    3. 1.6.3. Learning Rate Scheduling
  4. 1.7. Lesson 7 Loss Function Influence
    1. 1.7.1. Mean Square Error(MSE) VS Cross-entropy
  5. 1.8. Lesson8 CNN - Image Classification
    1. 1.8.1. Receptive field design
    2. 1.8.2. Typical Setting of Receptive neural
    3. 1.8.3. The same patterns appear in different regions.
    4. 1.8.4. does every region need a pattern detection?
    5. 1.8.5. Benefit of Convolutional Layer
    6. 1.8.6. Another good story to describe Convolutional Layer
    7. 1.8.7. Feature Map
    8. 1.8.8. Pooling
    9. 1.8.9. The Whole CNN
  6. 1.9. Lesson9 Self-Attention (part1)
    1. 1.9.1. What is the output?
    2. 1.9.2. Sequence Labeling: N2N
    3. 1.9.3. Self-attention
  7. 1.10. Lesson10 Self-Attention ( part2 )
    1. 1.10.1. Positional Encoding
    2. 1.10.2. Self-attention for Speech
    3. 1.10.3. Self-attention for image
    4. 1.10.4. Self-attention V.S. CNN
    5. 1.10.5. Self-attention V.S. RNN
    6. 1.10.6. Self-attention for Graph
  8. 1.11. Lesson11 Transformer 1
  9. 1.12. Lesson12 Transformer2
    1. 1.12.1. Sequence to Sequence 的 Encoder
    2. 1.12.2. Decoder. autoregressive decoder
    3. 1.12.3. encoder 与 decoder 怎样传递资讯
    4. 1.12.4. Training
    5. 1.12.5. Tips
  10. 1.13. GAN