[Probability] Bayesian Neural Network

이 글은 최성준 교수님의 Bayesian Deep Learning 강좌와 Yarin Gal의 논문을 참조한 글로, 필자의 이해를 위해 작성된 글입니다.

📟 Bayesian Neural Network

Replace the deterministic network's weight parameters with distributions over these parameters
average over all possible weightes (referred to as marginalisation)

Given a training dataset $\mathcal{D}=(\mathbf{X},\mathbf{Y})={(x_i, y_i)}_{i=1}^N$ , we would like to estimate a function $\mathbf{y = f(x)}$ that is likely to have generated our observations.

\subsection{Bayesian methods}
\subsubsection{Gaussian Process}

The Gaussian Process(GP) is a powerful tool in statistics that allows us to model distributions over functions. The Gaussian process offers desirable properties such as uncertainty estimates over the function values, robustness to over-fitting, and principled ways for hyper-parameter tuning. The use of approximate variational inference for the model allows us to scale it to large data via stochastic and distributed inference.\cite{Gal2015DropoutC}

Gaussian processes put prior distribution over the space of functions $p(\textbf{f})$ . This distribution is the prior belief as to which functions are more likely or less likely to have generated dataset. The posterior distribution over the space of functions given train dataset $(\mathbf{X},\mathbf{Y})$ :

This distribution captures the most likely functions given the train dataset. By modeling the distribution over the space of functions with a Gaussian process, we can analytically evaluate its corresponding posterior. To model the data we have to choose a covariance function for the Gaussian distribution. This function defines the (scalar) similarity between every pair of input points $\textbf{K}(x_i, x_j)$ . Given a train dataset of size N this function induces an N $\times$ N covariance matrix. Evaluating the Gaussian distribution involves an inversion of an N by N matrix, an operation that requires $\mathcal{O}(N^3)$ time complexity. Many approximations to the Gaussian process result in a manageable time complexity. Bayesian Neural Network solve this problem by variational inference(VI).

\subsubsection{Bayesian Neural Network}
Bayesian Neural Networks provide a principled mathematical framework where $\mathbf{W}_i$ is the NN's weight matrices of dimensions $K_i \times K_{i-1},$ $\mathbf{w} = \{\mathbf{W}_i\}_{i=1}^L$ is set of random variables for a neural network model with L layers. The predictive distribution for a new input point $x^*$ is:
$p (y * | x *, D) = \int p (y * | x *, w) p (w | D) d w <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>p</mi><mo stretchy="false">(</mo><msup><mi>y</mi><mo>*</mo></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mi>x</mi><mo>*</mo></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mo data-mjx-texclass="OP">\int</mo><mi>p</mi><mo stretchy="false">(</mo><msup><mi>y</mi><mo>*</mo></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mi>x</mi><mo>*</mo></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mo stretchy="false">)</mo><mi>p</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo stretchy="false">)</mo><mi>d</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow></math>$

Bayesian inference is used to compute a posterior over the weights $p(\mathbf{w}|\mathcal{D})$ . However exact Bayesian inference is computationally intractable for neural networks. (As $p(\mathbf{w}|\mathcal{D}) = p(\mathcal{D} | \mathbf{w}) p(\mathbf{w})/{p(\mathcal{D})}$ , $p(\mathcal{D})$ cannot usually be evalutaed analytically.) Instead of computing the posterior distribution, we can use variational inference \cite{hintonDescriptionlength} to approximate the (intractable) posterior distribution $p(\mathbf{w}|\mathcal{D})$ with (tractable) variational posterior $q_\theta(\mathbf{w})$ that we get by the KL divergence minimization.
The predictive distribution can be approximated by minimizing KL divergence.
$p (y * | x *, D) = \int p (y * | x *, w) p (w | D) d w \approx \int p (y * | x *, w) q θ (w) d w <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>p</mi><mo stretchy="false">(</mo><msup><mi>y</mi><mo>*</mo></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mi>x</mi><mo>*</mo></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mo data-mjx-texclass="OP">\int</mo><mi>p</mi><mo stretchy="false">(</mo><msup><mi>y</mi><mo>*</mo></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mi>x</mi><mo>*</mo></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mo stretchy="false">)</mo><mi>p</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo stretchy="false">)</mo><mi>d</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mo>\approx</mo><mo data-mjx-texclass="OP">\int</mo><mi>p</mi><mo stretchy="false">(</mo><msup><mi>y</mi><mo>*</mo></msup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mi>x</mi><mo>*</mo></msup><mo>,</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mo stretchy="false">)</mo><msub><mi>q</mi><mi>θ</mi></msub><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mo stretchy="false">)</mo><mi>d</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow></math>$
The level of similarity among two distributions would be:

Minimizing the KL divergence is equivalent to maximizing the ELBO(Evidence Lower BOund) which also contains the integral with respect to the variational parameters defining $q_\theta(\mathbf{w})$ .

Maximizing ELBO will result in a variational distribution $q_\theta(\textbf{w})$ that explains the data well.

where $q_\theta^*(\textbf{w})$ is an optimum. To sum up, on Bayesian Neural Network there are many methods of approximating posterior distribution $p(\textbf{w}|\mathcal{D})$ that can be used to approximate the predictive distribution $p(y^*|x^*, \mathcal{D})$ . For variational inference, there are many methods of approximating posterior distribution $p(\textbf{w}|\mathcal{D})$ for example Monte Carlo sampling, SGD, EM algorithm. MC dropout use Monte Carlo sampling.

저작자표시 비영리 동일조건

'Mathematics > Probability, Statistics, Information' 카테고리의 다른 글

[Probability] 3. Gaussian process, Gaussian Process Latent Variable Model(GPLVM) (0)	2022.03.05
[Probability] 2. Random Process, Random Variable, Functional analysis, Kernel function (0)	2022.03.05
[Probability] 1. Probability Distribution: Gaussian Distribution (0)	2022.02.19
[Probability] Gaussian, Bayesian 용어 정리 (0)	2022.02.12
[Probability] Gaussian Process (0)	2022.01.21

📟 Bayesian Neural Network

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[Probability] Bayesian Neural Network

📟 Bayesian Neural Network

'Mathematics > Probability, Statistics, Information' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역