Batch normalization (also known as batch norm) is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.[1]
The reasons behind the effectiveness of batch normalization remain under discussion. It was believed that it can mitigate the problem of internal covariate shift, where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network.[1] Recently, some scholars have argued that batch normalization does not reduce internal covariate shift, but rather smooths the objective function, which in turn improves the performance.[2] However, at initialization, batch normalization in fact induces severe gradient explosion in deep networks, which is only alleviated by skip connections in residual networks.[3] Others maintain that batch normalization achieves length-direction decoupling, and thereby accelerates neural networks.[4]
Internal covariate shift
Each layer of a neural network has inputs with a corresponding distribution, which is affected during the training process by the randomness in the parameter initialization and the randomness in the input data. The effect of these sources of randomness on the distribution of the inputs to internal layers during training is described as internal covariate shift. Although a clear-cut precise definition seems to be missing, the phenomenon observed in experiments is the change on means and variances of the inputs to internal layers during training.
Batch normalization was initially proposed to mitigate internal covariate shift.[1] During the training stage of networks, as the parameters of the preceding layers change, the distribution of inputs to the current layer changes accordingly, such that the current layer needs to constantly readjust to new distributions. This problem is especially severe for deep networks, because small changes in shallower hidden layers will be amplified as they propagate within the network, resulting in significant shift in deeper hidden layers. Therefore, the method of batch normalization is proposed to reduce these unwanted shifts to speed up training and to produce more reliable models.
Besides reducing internal covariate shift, batch normalization is believed to introduce many other benefits. With this additional operation, the network can use higher learning rate without vanishing or exploding gradients. Furthermore, batch normalization seems to have a regularizing effect such that the network improves its generalization properties, and it is thus unnecessary to use dropout to mitigate overfitting. It has also been observed that the network becomes more robust to different initialization schemes and learning rates while using batch normalization.
Procedures
Transformation
In a neural network, batch normalization is achieved through a normalization step that fixes the means and variances of each layer's inputs. Ideally, the normalization would be conducted over the entire training set, but to use this step jointly with stochastic optimization methods, it is impractical to use the global information. Thus, normalization is restrained to each mini-batch in the training process.
Let us use B to denote a mini-batch of size m of the entire training set. The empirical mean and variance of B could thus be denoted as
and .
For a layer of the network with d-dimensional input, , each dimension of its input is then normalized (i.e. re-centered and re-scaled) separately,
, where and ; and are the per-dimension mean and standard deviation, respectively.
is added in the denominator for numerical stability and is an arbitrarily small constant. The resulting normalized activation have zero mean and unit variance, if is not taken into account. To restore the representation power of the network, a transformation step then follows as
,
where the parameters and are subsequently learned in the optimization process.
Formally, the operation that implements batch normalization is a transform called the Batch Normalizing transform. The output of the BN transform is then passed to other network layers, while the normalized output remains internal to the current layer.
Backpropagation
The described BN transform is a differentiable operation, and the gradient of the lossl with respect to the different parameters can be computed directly with the chain rule.
Specifically, depends on the choice of activation function, and the gradient against other parameters could be expressed as a function of :
,
, , , ,
and .
Inference
During the training stage, the normalization steps depend on the mini-batches to ensure efficient and reliable training. However, in the inference stage, this dependence is not useful any more. Instead, the normalization step in this stage is computed with the population statistics such that the output could depend on the input in a deterministic manner. The population mean, , and variance, , are computed as:
, and .
The population statistics thus is a complete representation of the mini-batches.
The BN transform in the inference step thus becomes
,
where is passed on to future layers instead of . Since the parameters are fixed in this transformation, the batch normalization procedure is essentially applying a linear transform to the activation.
Theory
Although batch normalization has become popular due to its strong empirical performance, the working mechanism of the method is not yet well-understood. The explanation made in the original paper[1] was that batch norm works by reducing internal covariate shift, but this has been challenged by more recent work. One experiment[2] trained a VGG-16 network[5] under 3 different training regimes: standard (no batch norm), batch norm, and batch norm with noise added to each layer during training. In the third model, the noise has non-zero mean and non-unit variance, i.e. it explicitly introduces covariate shift. Despite this, it showed similar accuracy to the second model, and both performed better than the first, suggesting that covariate shift is not the reason that batch norm improves performance.
Using batch normalization causes the items in a batch to no longer be iid, which can lead to difficulties in training due to lower quality gradient estimation.[6]
Smoothness
One alternative explanation,[2] is that the improvement with batch normalization is instead due to it producing a smoother parameter space and smoother gradients, as formalized by a smaller Lipschitz constant.
Consider two identical networks, one contains batch normalization layers and the other does not, the behaviors of these two networks are then compared. Denote the loss functions as and , respectively. Let the input to both networks be , and the output be , for which , where is the layer weights. For the second network, additionally goes through a batch normalization layer. Denote the normalized activation as , which has zero mean and unit variance. Let the transformed activation be , and suppose and are constants. Finally, denote the standard deviation over a mini-batch as .
First, it can be shown that the gradient magnitude of a batch normalized network, , is bounded, with the bound expressed as
.
Since the gradient magnitude represents the Lipschitzness of the loss, this relationship indicates that a batch normalized network could achieve greater Lipschitzness comparatively. Notice that the bound gets tighter when the gradient correlates with the activation , which is a common phenomena. The scaling of is also significant, since the variance is often large.
Secondly, the quadratic form of the loss Hessian with respect to activation in the gradient direction can be bounded as
.
The scaling of indicates that the loss Hessian is resilient to the mini-batch variance, whereas the second term on the right hand side suggests that it becomes smoother when the Hessian and the inner product are non-negative. If the loss is locally convex, then the Hessian is positive semi-definite, while the inner product is positive if is in the direction towards the minimum of the loss. It could thus be concluded from this inequality that the gradient generally becomes more predictive with the batch normalization layer.
It then follows to translate the bounds related to the loss with respect to the normalized activation to a bound on the loss with respect to the network weights:
, where and .
In addition to the smoother landscape, it is further shown that batch normalization could result in a better initialization with the following inequality:
, where and are the local optimal weights for the two networks, respectively.
Some scholars argue that the above analysis cannot fully capture the performance of batch normalization, because the proof only concerns the largest eigenvalue, or equivalently, one direction in the landscape at all points. It is suggested that the complete eigenspectrum needs to be taken into account to make a conclusive analysis.[4]
Since it is hypothesized that batch normalization layers could reduce internal covariate shift, an experiment[citation needed] is set up to measure quantitatively how much covariate shift is reduced. First, the notion of internal covariate shift needs to be defined mathematically. Specifically, to quantify the adjustment that a layer's parameters make in response to updates in previous layers, the correlation between the gradients of the loss before and after all previous layers are updated is measured, since gradients could capture the shifts from the first-order training method. If the shift introduced by the changes in previous layers is small, then the correlation between the gradients would be close to 1.
The correlation between the gradients are computed for four models: a standard VGG network,[5] a VGG network with batch normalization layers, a 25-layer deep linear network (DLN) trained with full-batch gradient descent, and a DLN network with batch normalization layers. Interestingly, it is shown that the standard VGG and DLN models both have higher correlations of gradients compared with their counterparts, indicating that the additional batch normalization layers are not reducing internal covariate shift.
Vanishing/exploding gradients
Even though batchnorm was originally introduced to alleviate gradient vanishing or explosion problems, a deep batchnorm network in fact suffers from gradient explosion at initialization time, no matter what it uses for nonlinearity. Thus the optimization landscape is very far from smooth for a randomly initialized, deep batchnorm network.
More precisely, if the network has layers, then the gradient of the first layer weights has norm for some depending only on the nonlinearity.
For any fixed nonlinearity, decreases as the batch size increases. For example, for ReLU, decreases to as the batch size tends to infinity.
Practically, this means deep batchnorm networks are untrainable.
This is only relieved by skip connections in the fashion of residual networks.[3]
This gradient explosion on the surface contradicts the smoothness property explained in the previous section, but in fact they are consistent. The previous section studies the effect of inserting a single batchnorm in a network, while the gradient explosion depends on stacking batchnorms typical of modern deep neural networks.
Decoupling
Another possible reason for the success of batch normalization is that it decouples the length and direction of the weight vectors and thus facilitates better training.
By interpreting batch norm as a reparametrization of weight space, it can be shown that the length and the direction of the weights are separated and can thus be trained separately. For a particular neural network unit with input and weight vector , denote its output as , where is the activation function, and denote . Assume that , and that the spectrum of the matrix is bounded as , , such that is symmetric positive definite. Adding batch normalization to this unit thus results in
, by definition.
The variance term can be simplified such that . Assume that has zero mean and can be omitted, then it follows that
, where is the induced norm of , .
Hence, it could be concluded that , where , and and accounts for its length and direction separately. This property could then be used to prove the faster convergence of problems with batch normalization.
Linear convergence
Least-square problem
With the reparametrization interpretation, it could then be proved that applying batch normalization to the ordinary least squares problem achieves a linear convergence rate in gradient descent, which is faster than the regular gradient descent with only sub-linear convergence.
Denote the objective of minimizing an ordinary least squares problem as
, where and .
Since , the objective thus becomes
, where 0 is excluded to avoid 0 in the denominator.
Since the objective is convex with respect to , its optimal value could be calculated by setting the partial derivative of the objective against to 0. The objective could be further simplified to be
.
Note that this objective is a form of the generalized Rayleigh quotient
, where is a symmetric matrix and is a symmetric positive definite matrix.
It is proven that the gradient descent convergence rate of the generalized Rayleigh quotient is
, where is the largest eigenvalue of , is the second largest eigenvalue of , and is the smallest eigenvalue of .[7]
In our case, is a rank one matrix, and the convergence result can be simplified accordingly. Specifically, consider gradient descent steps of the form with step size , and starting from , then
.
Learning halfspace problem
The problem of learning halfspaces refers to the training of the Perceptron, which is the simplest form of neural network. The optimization problem in this case is
, where and is an arbitrary loss function.
Suppose that is infinitely differentiable and has a bounded derivative. Assume that the objective function is -smooth, and that a solution exists and is bounded such that . Also assume is a multivariate normal random variable. With the Gaussian assumption, it can be shown that all critical points lie on the same line, for any choice of loss function . Specifically, the gradient of could be represented as
, where , , and is the -th derivative of .
By setting the gradient to 0, it thus follows that the bounded critical points can be expressed as , where depends on and . Combining this global property with length-direction decoupling, it could thus be proved that this optimization problem converges linearly.
First, a variation of gradient descent with batch normalization, Gradient Descent in Normalized Parameterization (GDNP), is designed for the objective function , such that the direction and length of the weights are updated separately. Denote the stopping criterion of GDNP as
.
Let the step size be
.
For each step, if , then update the direction as
.
Then update the length according to
, where is the classical bisection algorithm, and is the total iterations ran in the bisection step.
Denote the total number of iterations as , then the final output of GDNP is
.
The GDNP algorithm thus slightly modifies the batch normalization step for the ease of mathematical analysis.
It can be shown that in GDNP, the partial derivative of against the length component converges to zero at a linear rate, such that
, where and are the two starting points of the bisection algorithm on the left and on the right, correspondingly.
Further, for each iteration, the norm of the gradient of with respect to converges linearly, such that
.
Combining these two inequalities, a bound could thus be obtained for the gradient with respect to :
, such that the algorithm is guaranteed to converge linearly.
Although the proof stands on the assumption of Gaussian input, it is also shown in experiments that GDNP could accelerate optimization without this constraint.
Neural networks
Consider a multilayer perceptron (MLP) with one hidden layer and hidden units with mapping from input to a scalar output described as
, where and are the input and output weights of unit correspondingly, and is the activation function and is assumed to be a tanh function.
The input and output weights could then be optimized with
, where is a loss function, , and .
Consider fixed and optimizing only , it can be shown that the critical points of of a particular hidden unit , , all align along one line depending on incoming information into the hidden layer, such that
, where is a scalar, .
This result could be proved by setting the gradient of to zero and solving the system of equations.
Apply the GDNP algorithm to this optimization problem by alternating optimization over the different hidden units. Specifically, for each hidden unit, run GDNP to find the optimal and . With the same choice of stopping criterion and stepsize, it follows that
.
Since the parameters of each hidden unit converge linearly, the whole optimization problem has a linear rate of convergence.[4]
References
^ abcdIoffe, Sergey; Szegedy, Christian (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". arXiv:1502.03167 [cs.LG].
^ abcdSanturkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (29 May 2018). "How Does Batch Normalization Help Optimization?". arXiv:1805.11604 [stat.ML].
^ abYang, Greg; Pennington, Jeffrey; Rao, Vinay; Sohl-Dickstein, Jascha; Schoenholz, Samuel S. (2019). "A Mean Field Theory of Batch Normalization". arXiv:1902.08129 [cs.NE].
^ abcKohler, Jonas; Daneshmand, Hadi; Lucchi, Aurelien; Zhou, Ming; Neymeyr, Klaus; Hofmann, Thomas (27 May 2018). "Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization". arXiv:1805.10694 [stat.ML].
^ abSimonyan, Karen; Andrew, Zisserman (2014). "Very Deep Convolution Networks for Large Scale Image Recognition". arXiv:1409.1556 [cs.CV].
Ioffe, Sergey; Szegedy, Christian (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, July 2015 Pages 448–456
Simonyan, Karen; Zisserman, Andrew (2014). "Very Deep Convolutional Networks for Large-Scale Image Recognition". arXiv:1409.1556 [cs.CV].
Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Februari 2023. Vladislavs AgurjanovsNama lengkapVladislavs AgurjanovsLahir3 Desember 1989 (umur 34)Preiļi, LatviaPeringkat tertinggi782Tinggi186 m (610 ft 3 in)Berat750 kg (1.650 pon; 118 st) Vladislavs Agurjanovs (lahir 3 Desembe...
Doğu PerinçekPerinçek pada Mei 2018 Ketua Partai PatriotikPetahanaMulai menjabat 15 Februari 2015 PendahuluDiri sendiri (sebagai Ketua Partai Pekerja)PenggantiPetahanaKetua Partai PekerjaMasa jabatan10 Juli 1992 – 15 Februari 2015 PendahuluPartai didirikanPenggantiPetahanaKetua Partai SosialisMasa jabatanJuly 1991 – 10 July 1992 PendahuluFerit İlseverPenggantiPetahanaKetua Partai Buruh dan Tani TurkiMasa jabatan29 Januari 1978 – 12 September 1980 Infor...
Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Februari 2023. Artikel ini membutuhkan rujukan tambahan agar kualitasnya dapat dipastikan. Mohon bantu kami mengembangkan artikel ini dengan cara menambahkan rujukan ke sumber tepercaya. Pernyataan tak bersumber bisa saja dipertentangkan dan dihapus.Cari sumber:...
Basilika Santo NikolausBasilika Minor Santo Nikolaus di Tolentinobahasa Italia: Basilika S. Nicola di TolentinoBasilika Santo NikolasLokasiTolentinoNegara ItaliaDenominasiGereja Katolik RomaArsitekturStatusBasilika minorStatus fungsionalAktifAdministrasiKeuskupanKeuskupan Macerata-Tolentino-Recanati-Cingoli-Treia Basilika Santo Nikolaus (bahasa Italia: Basilika S. Nicola di Tolentino) yang bernama resmi Basilika Minor Santo Nikolaus di Tolentino adalah sebuah gereja basilika mino...
Caryophyllene Names Preferred IUPAC name (1R,4E,9S)-4,11,11-Trimethyl-8-methylidenebicyclo[7.2.0]undec-4-ene Other names β-Caryophyllenetrans-(1R,9S)-8-Methylene-4,11,11-trimethylbicyclo[7.2.0]undec-4-ene Identifiers CAS Number 87-44-5 Y 3D model (JSmol) Interactive image ChEBI CHEBI:10357 Y ChEMBL ChEMBL445740 Y ChemSpider 4444848 Y ECHA InfoCard 100.001.588 PubChem CID 5281515 UNII BHW853AU9H Y CompTox Dashboard (EPA) DTXSID8024739 InChI InChI=1S/C15H24/c1-11-6-5-...
Barium hidroksida Penanda Nomor CAS 17194-00-2 Y22326-55-2 (monohidrat) N12230-71-6 (oktahidrat) N Model 3D (JSmol) Gambar interaktif 3DMet {{{3DMet}}} ChEBI CHEBI:32592 Y ChemSpider 26408 Y Nomor EC Referensi Gmelin 846955 PubChem CID 28387 Nomor RTECS {{{value}}} CompTox Dashboard (EPA) DTXSID10892155 InChI InChI=1S/Ba.2H2O/h;2*1H2/q+2;;/p-2 YKey: RQPZNWPYLFFXCP-UHFFFAOYSA-L YInChI=1/Ba.2H2O/h;2*1H2/q+2;;/p-2Key: RQPZNWPYLFFXCP-NUQVWON...
Маги́ческий, или волше́бный квадра́т — квадратная таблица n × n {\displaystyle n\times n} , заполненная n 2 {\displaystyle n^{2}} различными числами таким образом, что сумма чисел в каждой строке, каждом столбце и на обеих диагоналях одинакова. Если в квадрате равны суммы чисел только в ст�...
Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Maret 2016. SMA Negeri 1 BuruInformasiJurusan atau peminatanIPA dan IPSRentang kelasX IPA, X IPS, XI IPA, XI IPS, XII IPA, XII IPSKurikulumKurikulum 2013AlamatLokasiJl. Jiku Besar 01, Namlea, MalukuMoto SMA Negeri (SMAN) 1 Buru, merupakan salah satu Sekolah Menengah...
Fanny Blankers-KoenFanny Blankers-Koen in 1988Informasi pribadiLahir26 April 1918Lage Vuursche, BelandaMeninggal25 Januari 2004 (umur 85)Hoofddorp, BelandaTinggi175 m (574 ft 2 in)Berat63 kg (139 pon) (139 pon) OlahragaOlahragaAthletics Francina Fanny Elsje Blankers-Koen (26 April 1918-25 Januari 2004) adalah atlet track and fields dari Belanda yang terkenal memenangi 4 medali emas pada Olimpiade musim panas pada tahun 1948 di London. Dia berkompetisi pada saat itu se...
Species of grass Kamut redirects here. For the village in Hungary, see Kamut, Hungary. This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Khorasan wheat – news · newspapers · books · scholar · JSTOR (December 2020) (Learn how and when to remove this message) Khorasan wheat Scientific classification Kingdom: Pla...
Union ibérique 1580–1640 Carte de l'Union ibérique.Informations générales Capitale Madrid, Lisbonne Histoire et événements 1580 Union du Portugal et de la Monarchie catholique 1640 Séparation du Portugal et de la Monarchie catholique Entités précédentes : Royaume de PortugalMonarchie catholique Entités suivantes : Royaume de PortugalMonarchie catholiqueRépublique catalane modifier - modifier le code - voir Wikidata (aide) L'Union ibérique est l'expression historiogra...
1957 American film by John Farrow The Unholy WifeTheatrical release posterDirected byJohn FarrowScreenplay byJonathan LatimerBased on1956 teleplay The Prowler on Climax!by William DurkeeProduced byJohn FarrowStarring Diana Dors Rod Steiger Tom Tryon Beulah Bondi CinematographyLucien BallardEdited byEda WarrenMusic byDaniele AmfitheatrofProductioncompaniesRKO Radio PicturesJohn Farrow ProductionsDistributed byUniversal PicturesRelease dates October 1957 (1957-10) (United States) ...
Хип-хоп Направление популярная музыка Истоки фанкдискоэлектронная музыкадабритм-энд-блюзреггидэнсхоллджаз[1]чтение нараспев[англ.]исполнение поэзииустная поэзияозначиваниедюжины[англ.]гриотыскэтразговорный блюз Время и место возникновения Начало 1970-х, Бронкс, Н...
Church in California , United StatesHollywood United Methodist ChurchFirst United Methodist Church of Hollywood34°06′16″N 118°20′20″W / 34.10444°N 118.33889°W / 34.10444; -118.33889Location6817 Franklin Ave., Los Angeles, California 90028CountryUnited StatesDenominationUnited Methodist ChurchWebsitehollywoodumc.orgHistoryFoundedMarch 16, 1930AdministrationDivisionCalifornia Pacific ConferenceSubdivisionLos Angeles DistrictClergyAssistant priest(s)Pastor Bri...
Moroccan football midfielder Brahim El Bahri Brahim El Bahri with Fath Rabat, September 2012Personal informationFull name Brahim El BahriDate of birth (1986-03-26) March 26, 1986 (age 38)Place of birth Taounate, MoroccoHeight 1.86 m (6 ft 1 in)Position(s) MidfielderTeam informationCurrent team CR Khemis ZemamraYouth career–2006 FAR RabatSenior career*Years Team Apps (Gls)2006–2007 FAR Rabat 2007–2011 Le Mans 18 (0)2009–2010 → FC Istres (loan) 30 (4)2011–2014 FU...
American basketball player (born 1984) Carmelo AnthonyAnthony with the Los Angeles Lakers in 2022Personal informationBorn (1984-05-29) May 29, 1984 (age 39)New York City, New York, U.S.Listed height6 ft 7 in (2.01 m)Listed weight238 lb (108 kg)Career informationHigh school Towson Catholic(Towson, Maryland) Oak Hill Academy(Mouth of Wilson, Virginia) CollegeSyracuse (2002–2003)NBA draft2003: 1st round, 3rd overall pickSelected by the Denver NuggetsPlaying career...
Italian painter (1582–1622) Bartolomeo ManfrediTavern Scene with a Lute Player by Bartolomeo ManfrediBorn25 August 1582Died12 December 1622(1622-12-12) (aged 40)NationalityItalian Bartolomeo Manfredi (baptised 25 August 1582 – 12 December 1622) was an Italian painter, a leading member of the Caravaggisti (followers of Michelangelo Merisi da Caravaggio) of the early 17th century. Life Manfredi was born in Ostiano, near Cremona. He may have been a pupil of Caravaggio in Rome: at hi...
Russian and Soviet rocket engineer Nikolai TikhomirovНиколай ТихомировBust of Nikolai Ivanovich Tikhomirov. Saint Petersburg. Museum of rocket science.BornNovember 1859MoscowDied28 April 1930LeningradNationalityRussianKnown forfounder of the Gas Dynamics LaboratoryAwardsHero of Socialist LaborScientific careerFieldsspecialist in rocket technology Part of a series of articles on theSoviet space program Soviet crewed lunar programs Luna program Human spaceflight programs Vo...