GuoXin Li's Blog

Hongyi-Lee Machine Learning Spring 2021 Mandarin

字数统计: 1.2k阅读时长: 6 min
2021/03/17 61 Share

Hongyi-Lee Machine Learning Spring Mandarin —Notes

Lesson 1Basic Conceptions OF ML

Different type of Functions

Regression: The function outputs a scalar

Screen Shot 2021-03-16 at 23.45.08

Classification: Given options(classes), the function outputs the correct one.

Screen Shot 2021-03-16 at 23.46.21

Screen Shot 2021-03-16 at 23.47.04

How to find the Function

Three steps

  • Function with Unknown Parameters

  • Define Loss from Training Data

    • Loss is a function of parameters: L( b,w )
    • Loss: how good a set of value is.

    image-20210317214832354

    Loss: $L = \frac{1}{N}\sum{n}e{n}$

    Types of Loss

    • MAE: L is mean absolute error(MAE) e=|yy^|
    • MSE: L is mean square error(MSE) e=(yy^)2
    • if y and y^ both probability distributons —> Cross-entropy

    image-20210317220111124

  • Optimization

    the way to find a best w and b: gradient descent

    • randomly pick a initial value of w0

    • compute Lw|w=w0

      • Negative —> increase w
      • Positiive —> decrease w
      • learn rate : hyperparameters —> can modify be human

      Screen Shot 2021-03-17 at 22.12.31

    • update w iteratively

disadvantage of gradient descent: may sometimes can not find the absolutely right best w* —> local minima

Screen Shot 2021-03-17 at 22.21.45

Same steps of the finding of b*

image-20210317222343212

Above summary is called Train

image-20210317222734598

Gradually considering more days of a circle to compute the best w* (different models)

image-20210317223404414

Less2 Basic Conceptions OF ML (second)

Model Bias

Screen Shot 2021-03-18 at 22.22.58

Screen Shot 2021-03-19 at 23.51.31

Approximate cotinuous curve by a piecewise linear curve.

To have good approximation, we need sufficient pices.

Screen Shot 2021-03-20 at 00.05.44

using multiple lines to fitting the red curve

image-20210320213544357

New Model: more Features

image-20210320214505891

image-20210320214758772

image-20210320214859012

image-20210320214918567

image-20210320215011501

image-20210320215055365

image-20210320215133654

image-20210320215154173

using /theta to represent all unknow parameters

image-20210320215325887

image-20210320215704989

Loss

Loss is a function of parameters Lθ

Loss means how good a set of values is.

image-20210320221049011

Optimization of New Model

image-20210320221457139

image-20210320221533290

image-20210320221732179

image-20210320222205683

image-20210320222432506

image-20210320222441121

Activation function

ReLU = 2 times Sigmoid

image-20210320222528754

image-20210320222836306

complexity model

image-20210320222718395

image-20210320222803883

image-20210320223632708

image-20210320223709586

image-20210320223736747

image-20210320223801600

Why don’t we go deeper? —— Overfitting

image-20210320224030451

Less3 Mission Strategy OF ML

Screen Shot 2021-03-18 at 22.17.31

Screen Shot 2021-03-21 at 16.38.10

how to improve the currection

Screen Shot 2021-03-21 at 16.39.41

Screen Shot 2021-03-21 at 16.44.55

56-layer can do the things 20-layer do

Screen Shot 2021-03-21 at 16.49.46

Screen Shot 2021-03-21 at 16.51.26

The way to solve overfitting

Screen Shot 2021-03-21 at 16.53.15

constrained model to solve overfitting

Screen Shot 2021-03-21 at 16.54.48

Screen Shot 2021-03-21 at 16.56.10

Screen Shot 2021-03-21 at 16.58.42

Bias-Complexity

Screen Shot 2021-03-21 at 17.13.34

Screen Shot 2021-03-21 at 17.24.52

good way to improve currection

Split Training Set into Training Set and Validation Set

Screen Shot 2021-03-21 at 18.01.28

Screen Shot 2021-03-21 at 18.03.57

Mismatch

Screen Shot 2021-03-21 at 18.06.38

Lesson4 local minima saddle point

Critical point

image-20210322124818350

using Tayler Series Approximation to approximately represent the Loss Function

image-20210322125244588

(θθ)Tg will disappear when Loss is at critical point

image-20210322130418466

eigen value

image-20210322130924505

image-20210322130913155

image-20210322131430124

Example

image-20210322131550636

Higher Dimension

saddle point in higher dimension

image-20210322203114110

image-20210322203719848

Lesson5 Batch && momentum

image-20210322210323144

image-20210322211258840

image-20210322211559760

Large Batch need smaller time for one epoch while Small Batch need more time for one epoch

image-20210322211838918

image-20210322212553416

What’s wrong with large batch size? —> Optimization Fails

image-20210322212810837

Small batch: when one batch cracked in a critical point, another somehow will over come it.

image-20210322213134958

small batch is better on test data.

small batch is worse on test data. —> overfitting

image-20210322213445078

Many differences on Small Batch V.S. Large Batch

image-20210322214258723

Momentum

image-20210322214716830

Obey the inverse orientation of Gradient Descent to update the Next New Gradient

image-20210322214829989

image-20210322215300968

Lesson6 Learn rate (Error surface is rugged)

Tips for training: Adaptive Learning Rate.

Training stuck != Small Gradient

image-20210325094650916

Error surface is left side of above pic.

we can see the gradient vibrate between the two peak.

image-20210325095419288

Learning rate too big or too small.

Screen Shot 2021-03-25 at 10.32.12

Screen Shot 2021-03-25 at 10.42.42

Screen Shot 2021-03-25 at 10.43.55

Learning rate adapts dynamically

Screen Shot 2021-03-25 at 10.45.01

RMSProp is unpublished optimization algorithm designed for neural networks, first proposed by Geoff Hinton in lecture 6 of online course.

Screen Shot 2021-03-25 at 10.50.49

Screen Shot 2021-03-25 at 10.52.27

Screen Shot 2021-03-25 at 10.52.56

Screen Shot 2021-03-25 at 11.02.06

Screen Shot 2021-03-25 at 11.04.22

Learning Rate Scheduling

  • Decay
  • Warm up

Screen Shot 2021-03-25 at 11.05.33

Screen Shot 2021-03-25 at 11.18.04

Screen Shot 2021-03-25 at 11.19.27

Screen Shot 2021-03-25 at 11.22.35

Lesson 7 Loss Function Influence

Screen Shot 2021-03-25 at 13.01.23

Screen Shot 2021-03-25 at 13.01.44

Screen Shot 2021-03-25 at 13.05.14

Screen Shot 2021-03-25 at 13.06.17

Screen Shot 2021-03-25 at 13.07.45

Screen Shot 2021-03-25 at 13.11.46

Mean Square Error(MSE) VS Cross-entropy

Screen Shot 2021-03-25 at 13.16.52

Lesson8 CNN - Image Classification

Screen Shot 2021-03-26 at 21.12.20

Screen Shot 2021-03-26 at 21.23.33

Screen Shot 2021-03-26 at 21.29.28

All weight == input x kernal x out.

Identifying some critical patterns to ensure a certain object.

image-20210326213233416

Screen Shot 2021-03-26 at 21.37.56

Screen Shot 2021-03-26 at 21.47.26

Receptive field design

  • can different neurons have different sizes of receptive field.
  • cover only some channels.
  • not square receptive field.

Typical Setting of Receptive neural

  • kernal size usually is 3x3
  • one receptive neural typically has 64 neurons

  • stride

  • overlap
  • padding: when over the whole image

Screen Shot 2021-03-26 at 21.57.52

The same patterns appear in different regions.

does every region need a pattern detection?

Screen Shot 2021-03-26 at 21.59.40

Solution to above issue ==> Parameter Sharing

same weight in different regions

different weight in two neurons with the same receptive field == two neurons with the same receptive field would not share parameters.

Screen Shot 2021-03-26 at 22.03.48

Screen Shot 2021-03-26 at 22.10.09

Benefit of Convolutional Layer

Screen Shot 2021-03-26 at 22.12.56

Another good story to describe Convolutional Layer

image-20210326221441500

Screen Shot 2021-03-26 at 22.16.50

Feature Map

Screen Shot 2021-03-26 at 22.17.49

1channel x 64 filter —> “image” with 64 channels x 3x3x64 filter —> “image” with 64x3x3x64 channels —>…

When with deeper Convolutional Layers.

Screen Shot 2021-03-26 at 22.24.21

Screen Shot 2021-03-26 at 22.29.13

Screen Shot 2021-03-26 at 22.29.48

Screen Shot 2021-03-26 at 22.30.37

Pooling

Screen Shot 2021-03-26 at 22.31.40

Max pooling: select the max num of a receptive filed.

Screen Shot 2021-03-26 at 22.33.07

pooling: do the things that makes the image smaller.

Screen Shot 2021-03-26 at 22.34.44

pooling: sometimes make some damages in the result of Image Identification.

The Whole CNN

Screen Shot 2021-03-26 at 22.36.52

Flatten: makes the latter matrix like vector.

Then do with a softmax.

Screen Shot 2021-03-26 at 22.41.46

Why CNN for Go Playing?

  • some patterns are much smaller the the whole image
  • The same patterns apear in different regions.

Screen Shot 2021-03-26 at 22.43.10

Alpha Go does not use Pooling…

Screen Shot 2021-03-26 at 22.46.49

Screen Shot 2021-03-26 at 22.49.30

CNN need data augmentation in Image Identification work.

Screen Shot 2021-03-26 at 22.51.40

Lesson9 Self-Attention (part1)

货拉拉拉不拉拉布拉多犬呢

When the input is a set of vectors ( not one vector )

Screen Shot 2021-03-28 at 11.01.38

Screen Shot 2021-03-28 at 11.04.11

Screen Shot 2021-03-28 at 11.05.53

Screen Shot 2021-03-28 at 11.06.52

Screen Shot 2021-03-28 at 11.07.42

What is the output?

N to N

Screen Shot 2021-03-28 at 11.09.58

N to One

Screen Shot 2021-03-28 at 11.11.38

N to somewhat. seq2seq

seq2seq:

  • translate
  • speech recognization

Screen Shot 2021-03-28 at 11.12.37

Sequence Labeling: N2N

Screen Shot 2021-03-28 at 11.56.56

Self-attention

Attention is all you need.

image-20210328232756703

Screen Shot 2021-03-28 at 12.04.56

First of all, input into a self-attention layer; then it will output a series of different vectors; a series of different vectors input into Full Connection Layers.

FC and self-attention can be traded off to use. ( alternate use )

Screen Shot 2021-03-28 at 12.07.53

input of Self-attention can be either input or a hidden layer.

Screen Shot 2021-03-28 at 12.11.06

Screen Shot 2021-03-28 at 12.14.22

how to calculate alpha

Screen Shot 2021-03-28 at 12.16.56

Screen Shot 2021-03-28 at 12.32.30

Screen Shot 2021-03-28 at 12.33.46

Screen Shot 2021-03-28 at 12.35.30

One of the input vector ( V ) will be more sammilar to the result ( b ), if its attention is the biggest one.

Lesson10 Self-Attention ( part2 )

Screen Shot 2021-03-28 at 12.38.21

Screen Shot 2021-03-28 at 12.41.25

Screen Shot 2021-03-28 at 12.43.08

Screen Shot 2021-03-28 at 12.45.16

Screen Shot 2021-03-28 at 12.48.52

Screen Shot 2021-03-28 at 12.50.35

Multi-head Self-attention

image-20210328125717534

Screen Shot 2021-03-28 at 12.53.36

Screen Shot 2021-03-28 at 12.58.02

Positional Encoding

Screen Shot 2021-03-28 at 13.01.23

Screen Shot 2021-03-28 at 13.02.37

Self-attention for Speech

Screen Shot 2021-03-28 at 13.04.27

Self-attention for image

Screen Shot 2021-03-28 at 13.05.56

Screen Shot 2021-03-28 at 13.06.40

Self-attention V.S. CNN

Screen Shot 2021-03-28 at 13.08.32

Screen Shot 2021-03-28 at 13.09.27

Screen Shot 2021-03-28 at 13.12.16

Self-attention V.S. RNN

image-20210328131639631

Self-attention for Graph

Screen Shot 2021-03-28 at 13.18.35

Screen Shot 2021-03-28 at 13.20.39

Lesson11 Transformer 1

transformer 就是一个 sequence to sequence 的model

即 Seq2seq

就是说在 input 一个 sequence 之后再输出一个 sequence,这个 output 的 sequence 的长度是由模型所决定的,是不知道多长的。

Screen Shot 2021-04-22 at 10.51.15

语音合成:Text to Speech TTS

Screen Shot 2021-04-22 at 12.24.54

Screen Shot 2021-04-22 at 12.29.06

Lesson12 Transformer2

Sequence to Sequence 的 Encoder

encoder 首先要意义就是 input 一排向量,然后输出一排向量。可以使用 self-attention 或者 RNN 、CNN

Screen Shot 2021-04-22 at 12.34.14

简单化模型:

Screen Shot 2021-04-22 at 12.36.29

transformer 中

image-20210422124156322

Screen Shot 2021-04-22 at 12.43.20

Decoder. autoregressive decoder

image-20210422130533582

decoder 会把自己的输出当作下一个输入

Screen Shot 2021-04-22 at 13.06.35

Screen Shot 2021-04-29 at 23.55.15

Encoder 与 Decoder 的比较

Screen Shot 2021-04-29 at 23.55.52

Masked 的区别在于:

Screen Shot 2021-04-29 at 23.58.01

产生 b1 的时候不考虑后面的 a2,a3,a4,同理产生 b2,b3,b4.

需要添加特殊符号来使 decoder 停止

image-20210430000228745

Screen Shot 2021-04-30 at 00.02.46

Auto regression 与 Non-Auto regression

image-20210430000657115

encoder 与 decoder 怎样传递资讯

image-20210430000825225

Screen Shot 2021-04-30 at 00.10.14

Training

image-20210430001619998

image-20210430001657890

最小化 所有的 cross entropy 的总和。

Screen Shot 2021-04-30 at 00.18.13

decoder 的时候会输入正确的信息。

Tips

Screen Shot 2021-04-30 at 00.20.05

Screen Shot 2021-04-30 at 00.25.43

Screen Shot 2021-04-30 at 00.28.13

image-20210430003215351

Screen Shot 2021-04-30 at 00.32.53

这种一种方法是给 decoder 的输入加入一些错误的东西。即使用 scheduled sampling.

image-20210430003349640

GAN

Screen Shot 2022-03-31 at 20.16.22

希望生成的内容与给定的内容越接近越好。

GAN 训练技巧

js Divergence有问题在于 log(G,D)在没有 overlap之前都是 log2

换作为 Wasserstein 利用距离最小的计算来优化这种损失计算方式。

Screen Shot 2022-03-31 at 21.34.47

image-20220331213528261

Powered By Valine
v1.5.2
CATALOG
  1. 1. Hongyi-Lee Machine Learning Spring Mandarin —Notes
  2. 1.5. Lesson5 Batch && momentum
  3. 1.6. Lesson6 Learn rate (Error surface is rugged)
  4. 1.7. Lesson 7 Loss Function Influence
  5. 1.8. Lesson8 CNN - Image Classification
  6. 1.9. Lesson9 Self-Attention (part1)
  7. 1.10. Lesson10 Self-Attention ( part2 )
  8. 1.11. Lesson11 Transformer 1
  9. 1.12. Lesson12 Transformer2
  10. 1.13. GAN