이 글은 최성준 교수님의 Bayesian Deep Learning 강좌와 Yarin Gal의 논문을 참조한 글로, 필자의 이해를 위해 작성된 글입니다.
📟 Bayesian Neural Network
- Replace the deterministic network's weight parameters with distributions over these parameters
- average over all possible weightes (referred to as marginalisation)
Given a training dataset D=(X,Y)=(xi,yi)Ni=1, we would like to estimate a function y=f(x) that is likely to have generated our observations.
\subsection{Bayesian methods}
\subsubsection{Gaussian Process}
The Gaussian Process(GP) is a powerful tool in statistics that allows us to model distributions over functions. The Gaussian process offers desirable properties such as uncertainty estimates over the function values, robustness to over-fitting, and principled ways for hyper-parameter tuning. The use of approximate variational inference for the model allows us to scale it to large data via stochastic and distributed inference.\cite{Gal2015DropoutC}
Gaussian processes put prior distribution over the space of functions p(f). This distribution is the prior belief as to which functions are more likely or less likely to have generated dataset. The posterior distribution over the space of functions given train dataset (X,Y):
This distribution captures the most likely functions given the train dataset. By modeling the distribution over the space of functions with a Gaussian process, we can analytically evaluate its corresponding posterior. To model the data we have to choose a covariance function for the Gaussian distribution. This function defines the (scalar) similarity between every pair of input points K(xi,xj). Given a train dataset of size N this function induces an N × N covariance matrix. Evaluating the Gaussian distribution involves an inversion of an N by N matrix, an operation that requires O(N3) time complexity. Many approximations to the Gaussian process result in a manageable time complexity. Bayesian Neural Network solve this problem by variational inference(VI).
\subsubsection{Bayesian Neural Network}
Bayesian Neural Networks provide a principled mathematical framework where Wi is the NN's weight matrices of dimensions Ki×Ki−1, w={Wi}Li=1 is set of random variables for a neural network model with L layers. The predictive distribution for a new input point x∗ is:
Bayesian inference is used to compute a posterior over the weights p(w|D). However exact Bayesian inference is computationally intractable for neural networks. (As p(w|D)=p(D|w)p(w)/p(D), p(D) cannot usually be evalutaed analytically.) Instead of computing the posterior distribution, we can use variational inference \cite{hintonDescriptionlength} to approximate the (intractable) posterior distribution p(w|D) with (tractable) variational posterior qθ(w) that we get by the KL divergence minimization.
The predictive distribution can be approximated by minimizing KL divergence.
The level of similarity among two distributions would be:
Minimizing the KL divergence is equivalent to maximizing the ELBO(Evidence Lower BOund) which also contains the integral with respect to the variational parameters defining qθ(w).
Maximizing ELBO will result in a variational distribution qθ(w) that explains the data well.
where q∗θ(w) is an optimum. To sum up, on Bayesian Neural Network there are many methods of approximating posterior distribution p(w|D) that can be used to approximate the predictive distribution p(y∗|x∗,D). For variational inference, there are many methods of approximating posterior distribution p(w|D) for example Monte Carlo sampling, SGD, EM algorithm. MC dropout use Monte Carlo sampling.
'Mathematics > Probability, Statistics, Information' 카테고리의 다른 글
[Probability] 3. Gaussian process, Gaussian Process Latent Variable Model(GPLVM) (0) | 2022.03.05 |
---|---|
[Probability] 2. Random Process, Random Variable, Functional analysis, Kernel function (0) | 2022.03.05 |
[Probability] 1. Probability Distribution: Gaussian Distribution (0) | 2022.02.19 |
[Probability] Gaussian, Bayesian 용어 정리 (0) | 2022.02.12 |
[Probability] Gaussian Process (0) | 2022.01.21 |