Jekyll2019-02-15T07:48:59+00:00https://blog.yani.io/feed.xmlA Shallow Blog about Deep LearningDeep Learning BlogYani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/Pushing the Exoplanet Frontier with Deep Learning2018-10-12T00:00:00+01:002018-10-12T00:00:00+01:00https://blog.yani.io/nasa-fdl-exoplanets<p><img style="float: right;" src="/assets/images/posts/2018-10-12-nasa-fdl-exoplanets/nasatour.jpg" />
This summer I was invited to take part in the 2018 <a href="https://frontierdevelopmentlab.org">NASA Frontier Development Lab</a>, along with a small team including <a href="http://michelesasdelli.com">Michele Sasdelli (University of Adelaide)</a>, and a pair of planetary scientists, <a href="https://astro.berkeley.edu/researcher-profile/3629984-megan-ansdell">Megan Ansdel (University of California at Berkeley)</a> and <a href="http://www.hughosborn.co.uk">Hugh Osborn (Laboratoire d’Astrophysique de Marseille)</a>.
Our team composed of both machine learning and planetary scientists, was challenged over the course of 8 weeks to combine our expert knowledge in order to improve the methods behind one of the most exciting frontiers of science: <em>exoplanet discovery</em>.</p>
<p>Here I discuss some of the challenges of applying machine learning to real-world scientific data, in particular noisy and sparse periodic time-series data.</p>
<h1 id="exoplanets">Exoplanets</h1>
<p>Our knowledge of exoplanets, or planets that exist outside our Solar System, has advanced drastically over the last few decades. In fact, until relatively recently one could have called exoplanets a theoretical concept. The first confirmed detection of a real exoplanet wasn’t until 1992 <a href="/references/#ref-wolszczan1992planetary">(Wolszczan et al. 1992)</a>, and even by 2004 only about a hundred exoplanets had been detected. This all changed with the launch of the <a href="https://keplerscience.arc.nasa.gov">Kepler space telescope in 2009</a> <a href="/references/#ref-kepler">(Borucki et al. 2010)</a>. Since then, thousands of exoplanets have been detected with the ``transit” method (i.e., detecting an exoplanet by observing the drop in brightness of a star as the orbiting exoplanet crosses our line-or-sight to the star). This year (2018) a new but related space telescope was launched — the <a href="https://tess.gsfc.nasa.gov">Transiting Exoplanet Survey Satellite (TESS)</a> <a href="/references/#ref-tess">(Ricker et al. 2014)</a>. TESS, which also uses the transit method, will concentrate on finding exoplanets closer to Earth and around brighter stars than Kepler allowed, which will be important for follow-up observations necessary to help us learn more details about these exoplanets, such as their compositions and atmospheres.</p>
<p>TESS is already collecting new data with the potential to improve our knowledge of exoplanets, but there is a bottleneck to accessing this knowledge: humans. Each candidate planet identified by TESS must be confirmed by a scientist with follow-up observations from the ground, making it vital that these planet candidates be as reliable as possible, meaning that false positives must be minimized. At the same time, we must avoid missing real planets and minimize the false negatives.</p>
<h2 id="the-keplertess-pipeline">The Kepler/TESS Pipeline</h2>
<p>The raw data returned by a space telescope (i.e. Kepler/TESS) is essentially a very noisy, low-framerate video of a patch of the sky. For both Kepler and TESS, the data is a set of <em>Target Pixel Files</em> (TPFs), which are small (e.g. <script type="math/tex">11 \times 11</script> pixels) image frames roughly centered on the target stars, collected at a regular time interval over a given amount of time — every 2 minutes over 27 days for TESS. From these TPFs, the pipeline extracts time-series photometry of the target stars, called a “light curve”, which is essentially the 1D signal of the brightness of a star over time. The pipeline then must remove the systematic and random noise introduced by the instrument, as well as real stellar phenomena that can look similar to noisy exoplanet transit signatures. Only then can the pipeline search for the exoplanet <em>transit signals</em> — the characteristic drop in light when an exoplanet passes in front of its star (see <a href="#fig-transit">fig. 1</a>). This is no easy task, and a team at the NASA Ames Research Center has spent the greater part of a decade creating a pipeline to do all this for the Kepler data <a href="/references/#ref-keplerhandbook">(Jenkins, et al. 2017)</a>, and is now also applying using it to process the new TESS data.</p>
<h2 id="challenges-of-the-keplertess-exoplanet-datasets">Challenges of the Kepler/TESS Exoplanet Datasets</h2>
<figure id="fig-transit">
<img src="/assets/images/posts/2018-10-12-nasa-fdl-exoplanets/views.png" alt="Global and local views of a folded exoplanet transit signal from a light curve" />
<figcaption>Figure 1: Global and local views of a folded exoplanet transit signal from a light curve.
Light curves are the pre-processed output of the Kepler/TESS pipeline, a time-series photometry of the target stars. The pipeline removes systematic and random noise present in the raw images. Even then, the characteristic transit signals are barely above the noise floor and are *folded*, i.e. averaged over their period, to increase the signal, and consistently represent signals with different planetary periods.</figcaption>
</figure>
<p>The challenges present in the detection and classification of exoplanets are different than those typically seen in typical supervised machine learning problems. The quality and completeness of the “ground truth” is limited, being both noisy and intrinsically incomplete. Only the already discovered exoplanets are labelled and a large number of undiscovered planets are present in the data, sometimes incorrectly labelled as false negatives. Those planets that are labelled are biased towards the easier planets to detect — that is large planets. The nature of the TESS mission requires a quick “response time” in order to follow-up the most promising planets with other instruments without wasting limited telescope time and resources.</p>
<p>Transit signals are periodic dips in the light curves, but these are often near the noise floor of the data. Without the pipeline’s removal of systematic noise, transit signals are rarely apparent at all. Transit signals are also extremely sparse however, the occultation of the star due to the planet lasts only a very small fraction of the period and, depending on the planet’s orbital period, this might occur only a handful of times. What makes it possible to observe the transit signals at all is the high precision of the period of the dips, allowing the signals to be <em>folded</em>, averaged over the planetary period. The planetary period is unknown <em>a priori</em>, and must be exhaustively searched. All of these characteristics of the signals make the problem challenging, and different from problems typically solved with machine learning methods.</p>
<h2 id="learning-approaches">Learning Approaches</h2>
<h3 id="lightcurves">Lightcurves</h3>
<p>Recently a deep learning approach to exoplanet classification from detection candidate light curves output by the Kepler pipeline was proposed <a href="/references/#ref-shallue2018">(Shallue, et al. 2018)</a>.
A CNN is trained on the light curves folded by the candidate period. Our main approach was to expand upon their work by incorporating more domain knowledge, and extending the method to the TESS dataset which is quite different than that of Kepler. We also addressed the severe class imbalance present in TESS — few candidates are labelled as planets vs. non-planets – by using mini-batch class balancing. Our approach improved in both performance and efficiency over that of <a href="/references/#ref-shallue2018">(Shallue, et al. 2018)</a>. We also tried approaches of folding the candidate light curves on an exhaustive range of possible periods, avoiding the need for the pipeline’s detection.</p>
<h3 id="target-pixel-files">Target Pixel Files</h3>
<p>In a more ambitious on-going approach, we attempt to train on the raw TPF images, bypassing the Kepler/TESS pipeline.
With the intrinsic systematic noise, and given the sparsity and low signal-to-noise ratio, we have found a very challenging dataset to train with however. A typical TESS TPF time series has dimensions 11<script type="math/tex">\times</script>11<script type="math/tex">\times</script>19815 and there may be as few as two transits. It is clear that novel machine learning approaches may be required to learn to classify exoplanets in the face of extremely sparse periodic signals.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In exoplanet detection, many of the challenges we encountered were typical of applying machine learning to any real-world problem, such as class imbalance and noisy labels. However, the problems we encountered in trying to learn from the raw TPF images highlighted real-world data that is not well addressed by current machine learning methods. In particular learning sparse periodic signals with a low signal-to-noise ratio, and in the presence of strong systematic noise is challenging.</p>
<p>For more information, view our <a href="https://frontierdevelopmentlab.org/blog/2018/8/24/fdl-2018-exoplanets-team-presentation">recent presentation on the work we did over the summer</a>, or see the <a href="https://frontierdevelopmentlab.org/exoplanets">FDL exoplanets challenge website</a>.</p>
<h3 id="acknowledgments">Acknowledgments</h3>
<p>I’d like to thank the whole TESS team at SETI/NASA who were a massive help and proposed this challenge, in particular Jeffery Smith, Jon Jenkins and Douglas Caldwell. I’d also like to thank the NASA Frontier Development Lab organizers and mentors for giving us this unique opportunity. And finally I thank our challenge’s industry partners: Google Cloud and Kx Systems, in particular Google Cloud’s generous donation of compute resources without the support of which our work would not have been possible.</p>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/This summer I was invited to take part in the 2018 NASA Frontier Development Lab, along with a small team including Michele Sasdelli (University of Adelaide), and a pair of planetary scientists, Megan Ansdel (University of California at Berkeley) and Hugh Osborn (Laboratoire d’Astrophysique de Marseille). Our team composed of both machine learning and planetary scientists, was challenged over the course of 8 weeks to combine our expert knowledge in order to improve the methods behind one of the most exciting frontiers of science: exoplanet discovery.Learning with Backpropagation2018-04-05T00:00:00+01:002018-04-05T00:00:00+01:00https://blog.yani.io/sgd<p>Learning with backpropagation is much like the delta rule; sensitivities are used to correct weights proportional to a constant <em>learning rate</em> or <em>step size</em> parameter $\gamma$. Although the correction is proportional to the sensitivity, we wish to <em>reduce</em> the error $E^n$, and so we move the weight in the opposite direction of the gradient
$\frac{\partial E^n}{\partial w_{ji}}$.</p>
<h2 id="gradient-descent">Gradient Descent</h2>
<p>Formally, the weight change rule is given by,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Delta w^n_{ij} &= -\gamma \frac{\partial E^n}{\partial w_{ji}}\\
&= -\gamma \, \delta^n_j y_i,\label{eqn:backproplearningrule}
\end{align} %]]></script>
<p>where $\delta^n_j$ is as defined in <a href="/backpropagation/#mjx-eqn-eqn%3Afulldeltadefinition">backprop: (13)</a>, and $y_i$ is the output of neuron $i$.</p>
<figure id="fig-learningrate">
<img src="/assets/images/posts/2018-04-05-sgd/learningratedecrease.svg" alt="Learning rate and convergence" />
<figcaption>Figure 1: An illustration of the effect of step size (learning rate) and learning policy on convergence with backpropagation. This example is of a symmetric 2D error surface, where the parameters are initialized to one of the symmetrically identical surface points $x_i$ where $i=0\ldots4$. For each of the different initial learning rates $\gamma_i$, the learning rate is decreased by $10\%$ each iteration.</figcaption>
</figure>
<p>Backpropagation is a method of steepest descent. This is illustrated in <a href="#fig-learningrate">fig. 1</a>, where the backpropagation learning rule, \eqref{eqn:backproplearningrule}, specifies a step size in the form of the <em>learning rate</em>. The learning rate parameter scales the step size, or the magnitude of the weight change vector. <a href="#fig-learningrate">Fig. 1</a> also illustrates the effect of learning rate on gradient descent. Too small a learning rate can result in very slow learning such as for $\gamma_0$, while too large a step size can result in bouncing around the minima ($\gamma_2, \gamma_3$), or missing it altogether.</p>
<p>In order to settle into a local minima, the learning rate must also be decreased as training progresses. However, too fast a rate of decrease and it may never reach the basin of attraction of the local minima, as with $\gamma_0$, while if the rate of decrease is too slow it will take a very long time to enter the basin of attraction, such as with $\gamma_3$.</p>
<p>The balance of trying to find an appropriate learning rate and learning policy is unfortunately part of the “black magic” behind training DNNs which comes from experience, but <a href="/references/#ref-Bottou2012sgdtricks">Bottou et al., 2012</a>, <a href="/references/#ref-goodfellow2016deep">Goodfellow et al., 2016</a> are excellent references on some of the common approaches taken to make this task simpler.</p>
<h2 id="pathological">The Problem with First-Order Optimization</h2>
<p>The underlying reason learning rate and learning policy has such a large effect is that gradient descent is a <em>first-order</em> optimization method, and only considers the first-order partial derivatives, i.e. for a 2D error surface $E(x, y)$, gradient descent moves in the opposite direction of the gradient,</p>
<script type="math/tex; mode=display">\begin{equation}
\nabla E(x, y) = {\left(\frac{\partial E}{\partial x}, \frac{\partial E}{\partial y}\right)}.
\end{equation}</script>
<p>This gradient tells us the direction of maximum increase at a given point on the error surface, but it does not tell us any information about the <em>curvature</em> of the surface at that point. The curvature of the surface is described by higher-order derivatives such as the second-order partial derivatives, e.g. $\frac{\partial^2 E}{\partial x^2}$, and mixed partial derivatives, e.g. $\frac{\partial^2 E}{\partial x\,\partial y}$.</p>
<figure id="fig-pathological">
<img style="width: 50%; display: inline-block" id="fig-narrowvalleysgd" src="/assets/images/posts/2018-04-05-sgd/narrowvalley.svg" alt="Gradient descent" />
<img style="width: 50%; display: inline-block" id="fig-narrowvalleymomentum" src="/assets/images/posts/2018-04-05-sgd/narrowvalleymomentum.svg" alt="Momentum" />
<figcaption>Figure 2: Pathological curvature. An error surface $E(x, y)$ exhibiting a narrow valley, and the optimal path from the starting point to the minima shown by the red arrow. In a pathological error surface such as this, first-order methods cannot use the information provided by the Hessian on the surface curvature to avoid bouncing along the walls of the valley, slowing descent (as shown in the left image). Momentum alleviates this somewhat in damping the change in direction, by preserving information on previous gradients, allowing a quicker descent (as shown in the right image). Inspired by a similar diagram by <a href="/references/#ref-martens2010deep">Martens et al., 2010</a>.</figcaption>
</figure>
<p>These second-order partials give important information about the curvature of the error surface $E$. For example, in <a href="#fig-learningrate">fig. 1</a>, the error surface takes on an elliptical shape, which causes problems when we only consider the direction of maximum decrease $-\nabla E$. The classic example of such a pathological error surface for first-order methods is an error surface that looks like a narrow valley, as shown in <a href="#fig-narrowvalleysgd">fig. 2</a>. With an initialization outside the bottom of the valley, gradient descent will bounce along the walls of the valley, leading to a very slow learning convergence.</p>
<p>For well-behaved surfaces where the scaling of parameters is similar, basins of attraction around a minima are roughly circular, and thus avoid this problem, since the first-order gradients will point almost directly at the minima for any location on the error surface.</p>
<p>There are second-order optimization methods based on Newton’s method, however the issue is that they do not scale to the size of any practical DNNs. The matrix of second-order partial derivatives for a scalar-values function, the Hessian $\mathbf{H}$, is required for any full second-order optimization method, however the Hessian is square in the number of parameters in the network. For networks of millions of parameters this means storing the Hessian is infeasible.</p>
<p>There are a whole slew of optimization tricks for gradient descent, often attempting to compensate for the shortcomings of first-order optimization without using the Hessian, or using some approximation to it. We will not cover those here, since none of these were used in our experiments. A full background of the issues of optimization in DNNs is outside the scope of this dissertation, however interested readers should refer to <a href="/references/#ref-goodfellow2016deep">Goodfellow et al., 2016</a> to learn more about these methods, and <a href="/references/#ref-martens2010deep">Martens et al., 2010</a> for an excellent introduction to the problems of first and second-order optimization in DNNs.</p>
<h2 id="momentum">Momentum</h2>
<p>A common improvement to gradient descent is momentum (<a href="/references/#ref-polyak1964somep">Polyak et al., 1964</a>, <a href="/references/#ref-rumelhartbackprop">Rumelhart et al., 1961</a>), a trick for minimizing the effect of pathological curvature on gradient descent, which also helps with variance in gradients. The name comes from the analogy of the update to physical momentum $\rho$ for a moving particle, $\mathbf{p}=m\mathbf{v}$, where we assume unit mass, $m=1$.</p>
<p>In momentum, the gradients over multiple iterations are accumulated into a velocity gradient,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbf{v}_{t+1} &= \alpha \mathbf{v}_{t} - \gamma\nabla E(\mathbf{w})\\
\Delta\mathbf{w} &= \mathbf{w}_t + \mathbf{v}_{t+1},
\end{align} %]]></script>
<p>where $\gamma$ is the learning rate, $t$ is the iteration, $\nabla E$ is the gradient of the error surface $E(\mathbf{w})$ being minimized, and $\mathbf{w}$ is the weight vector optimized. Momentum in effect stores some information on the gradients found in past iterations, and uses this to damp the effect of a new gradient on the search direction, as illustrated in <a href="#fig-narrowvalleymomentum">fig. 1</a>. For error surfaces with pathological curvatures, this can dramatically speed up learning.</p>
<h2 id="batch-and-stochastic-gradient-descent">Batch and Stochastic Gradient Descent</h2>
<p>Although the backpropagation weight change rule, \eqref{eqn:backproplearningrule}, tells us how to change the weights given a single training sample $x^n$, in practice this method is rarely used. The reason is simply that the gradients from a single sample are too biased, or noisy, and they are not representative of the dataset in general; $\Delta w^n_{ij}$ is only an approximation to the true gradient we want — it is only from one sample, $x^n$, of the training dataset $X$.</p>
<h3 id="batch-gradient-descent">Batch Gradient Descent</h3>
<p>At the opposite end of the spectrum there is <em>batch</em> training where the gradient is computed over all data samples in the training set,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Delta w_{ij} &= -\gamma\frac{1}{N} \sum^N_{n=0} \frac{\partial E^n}{\partial w_{ji}},
\label{eqn:batchlearningrule}
\end{align} %]]></script>
<p>where $N$ is the number of training samples in $X$. Batch training gives us the true gradient, however it is also very expensive, since it requires us to perform the forward pass of the network over all training samples for every update.</p>
<h3 id="stochastic-gradient-descent">Stochastic Gradient Descent</h3>
<p>Instead of computing the gradient on only one training sample, or over the entire training set, we might instead use a significant subset of the training set — a <em>mini-batch</em>. This approach is called <em>Stochastic Gradient Descent</em> (SGD).</p>
<p>When using SGD, we randomly sample (without replacement) a subset of the training set $X_{\textrm{mb}} \subset X$, such that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Delta w_{ij} &= -\gamma \frac{1}{|X_{\textrm{mb}}|} \sum_{\{n|\,\mathbf{x}^n \in X_{\textrm{mb}}\}} \frac{\partial E^n}{\partial w_{ji}},\label{eqn:sgdrule}
\end{align} %]]></script>
<p>where the size of the mini-batch $|X_{\textrm{mb}}|$ should be significant enough to represent the statistics of the training set distribution, i.e. for a classification problem the mini-batch should capture a significant number of the classes in the training set. Using a mini-batch size of one, i.e. a single sample as shown in \eqref{eqn:backproplearningrule}, is a special case of SGD.</p>
<p>It has been observed in practice that adding noise to the gradient by using stochastic gradient descent often helps generalization compared to batch gradient descent, perhaps by preventing overfitting. Note that even if we use the true gradient for the training dataset, the training set $X$ is only a sampling of the population distribution we want our network to generalize to.</p>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/Learning with backpropagation is much like the delta rule; sensitivities are used to correct weights proportional to a constant learning rate or step size parameter $\gamma$. Although the correction is proportional to the sensitivity, we wish to reduce the error $E^n$, and so we move the weight in the opposite direction of the gradient $\frac{\partial E^n}{\partial w_{ji}}$.Backpropagation Derivation - Multi-layer Neural Networks2018-03-17T00:00:00+00:002018-03-17T00:00:00+00:00https://blog.yani.io/backpropagation<figure id="fig-neurononelayer">
<img src="/assets/images/posts/2018-03-16-deltarule/neurononelayer.svg" alt="Detailed illustration of a single-layer neural network" />
<figcaption>Figure 1. Detailed illustration of a single-layer neural network trainable with the delta rule. The input layer consists of a set of inputs, $\{ X_{0}, \ldots, X_{N} \}$. The layer has weights $\{w_{j0}, \ldots, w_{jN}\}$, bias $b_j$, net neuron activation $a_j = \sum_i w_{ji}$, activation function $f$, and output $y_j$. The error for output $y_j$, $e_j$, is calculated using the target label $t_j$.</figcaption>
</figure>
<h1 id="the-limitations-of-single-layer-networks">The Limitations of Single-Layer Networks</h1>
<p>A single-layer neural network, such as the perceptron shown in <a href="#fig-neurononelayer">fig. 1</a>, is only a linear classifier, and as such is ineffective at learning a large variety of tasks. Most notably, in the 1969 book <a href="/references/#ref-minsky1988perceptrons"><em>Perceptrons</em></a>, the authors showed that single-layer perceptrons could not learn to model functions as simple as the XOR function, amongst other non-linearly separable classification problems.</p>
<figure id="perceptronxor">
<img src="/assets/images/posts/2018-03-17-backprop/perceptronxor.svg" alt="An illustration of the inability to correctly classify the XOR function" />
<figcaption>
Figure 2. An illustration of the inability of a single line (i.e. a perceptron) to correctly classify the XOR function. Instead, the composition of two lines is required to correctly separate these samples, i.e. multiple layers.
</figcaption>
</figure>
<p>As shown in <a href="#perceptronxor">fig. 2</a>, no single line can separate even a sparse sampling of the XOR function — i.e. <em>it is not linearly separable</em>. Instead, only a composition of lines is able to correctly separate and classify this function, and other non-linearly separable problems.</p>
<p>At the time, it was not obvious how to train networks with more than one layer of neurons, since the methods of learning neuron weights, the <em>perceptron learning rule</em> <a href="/references/#ref-rosenblatt1961principles">(Rosenblatt et al.1961)</a> for perceptrons or the <em>delta rule</em> <a href="/references/#ref-widrow1960adaptive">(Widrow et al., 1960)</a> — <a href="/deltarule">as we derived in the previous post</a> — for general neurons, only applied to single-layered networks. This became known as the credit-assignment problem.</p>
<h1 id="backpropagation">Backpropagation</h1>
<p>The credit-assignment problem was solved with the discovery of <em>backpropagation</em> (also known as the <em>generalized delta rule</em>), allowing learning in multi-layer neural networks. It is somewhat controversial as to who first “discovered” backpropagation, since it is essentially the application of the chain rule to neural networks, however it’s generally accepted that it was first demonstrated experimentally by <a href="/references/#ref-rumelhartbackprop">Rumelhart et al., 1961</a>. Although it is “just the chain rule”, to dismiss this first demonstration of backpropagation in neural networks is to understate the importance of this discovery to the field, and to dismiss the practical difficulties in first implementing the algorithm — a fact that will be attested to by anyone who has since attempted.</p>
<p>The following is a derivation of backpropagation loosely based on the excellent references of <a href="/references/#ref-Bishop1995">Bishop (1995)</a> and <a href="/references/#ref-haykin1994neural">Haykin (1994)</a>, although with different notation. This derivation builds upon the derivation for the <a href="/deltarule">delta rule</a> in the previous section, although it is important to note that, as shown in <a href="#fig-neurontwolayer">fig. 3</a>, the indexing we will use to refer to neurons of different layers differs from that in <a href="#fig-neurononelayer">fig. 1</a> for the single-layer case.</p>
<p>We are interested in finding the sensitivity of the error $E$ to a given weight in the network $w_{ij}$. There are two classes of weights for which we must derive different rules,</p>
<ul id="enum-outputneuron">
<li><strong><em>Output Neurons:</em></strong> those belonging to <em>output layer</em> neurons, i.e. neurons lying directly before the output, such as $w_{kj}$ in <a href="#fig-neurontwolayer">fig. 3</a>, and</li>
</ul>
<ul id="enum-hiddenneuron">
<li><strong><em>Hidden Neurons:</em></strong> weights belonging to <em>hidden layer</em> neurons, such as $w_{ji}$ in <a href="#fig-neurontwolayer">fig. 3</a>.</li>
</ul>
<h3 id="output-layer">Output Layer</h3>
<p>The output weights are relatively easy to find, since they correspond to the same types of weights found in single-layer networks, and have direct access to the error signal, i.e. $e^n_j$.</p>
<p id="enum-outputneuron">Indeed <a href="/deltarule">our derivation of the delta rule</a> also describes the sensitivity of the weights in the output layer of a multi-layer neural network. With some change of notation (now indexing by $k$ rather than $j$ to match <a href="#fig-neurontwolayer">fig. 3</a>), we can use the same <a href="/deltarule">delta rule</a> sensitivity,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{kj}} &= \frac{\partial E^n}{\partial a^n_k} \frac{\partial a^n_k}{\partial w_{kj}}\\
\frac{\partial E^n}{\partial w_{kj}} &= \frac{\partial E^n}{\partial e^n_k}\frac{\partial e^n_k}{\partial y^n_k} \frac{\partial y^n_k}{\partial a^n_k}\\
&= - e^n_k f'\left( a^n_k \right) x^n_{kj}\\
&= \delta^n_k x^n_{kj}\\
&= \delta^n_k y^n_j.\label{eqn:outputlayer}
\end{align} %]]></script>
<h3 id="hidden-layer">Hidden Layer</h3>
<figure id="fig-hiddenneuron">
<img src="/assets/images/posts/2018-03-17-backprop/neurontwolayer.svg" alt="Detailed illustration of a neural network with a single hidden layer" />
<figcaption>Figure 3. Detailed illustration of a neural network with a single hidden layer. The input layer consists of a set of inputs, $\{x_{0}, \ldots, x_{N}\}$. The hidden layer has weights $\{w_{i0}, \ldots, w_{iN}\}$, bias $b_i$, net neuron activity $a_j = \sum_i w_{ji}$ and activation function $f$. The output layer with output $y_k$, has weights $\{w_{k0}, \ldots, w_{jN}\}$ and bias $b_j$. The error for output $y_k$, $e_k$, is calculated using the target label $t_k$. Note that $x^n_{kj} \equiv y^n_j$.
</figcaption>
</figure>
<p>We will first derive the partial derivative $\frac{\partial E^n}{\partial w_{ji}}$, for a single hidden layer network, such as that illustrated in <a href="#fig-hiddenneuron">fig. 3</a>. Unlike in the case of a single layer network, as covered in the previous derivation of the <a href="/deltarule">delta rule</a>, the weights belonging to hidden neurons have no direct access to the error signal, instead we must calculate the error signal from all of the neurons that indirectly connect the neuron to the error (i.e. every output neuron $y_k$).</p>
<p>Following from the chain rule we can write the partial derivative of a hidden weight $w_{ji}$ with respect to the error $E^n$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{ji}} &= \underbrace{\left( \sum_k \frac{\partial E^n}{\partial e^n_{k}} \frac{\partial e^n_{k}}{\partial y^n_{k}} \frac{\partial y^n_{k}}{\partial a^n_k} \frac{\partial a^n_k}{\partial y^n_{j}}\right)}_\text{output neurons} \underbrace{\frac{\partial y^n_{j}}{\partial a^n_{j}} \frac{\partial a^n_{j}}{\partial w_{ji}}}_\text{hidden neuron},\label{eqn:twolayer1}
\end{align} %]]></script>
<p>where the sum arises from the fact that, unlike in <a href="/deltarule/#mjx-eqn-eqn%3Asumonetermpartial">delta rule: (10)</a> where the weight $w_{kj}$ affects only a single output, the hidden weight $w_{ji}$ affects all neurons in the subsequent layer (see <a href="#fig-neurontwolayer">fig. 3</a>).</p>
<p>We already know how to calculate the partials for the output layer from the derivation of the delta rule for single-layer networks, and we can substitute these from \eqref{eqn:outputlayer} for the output neuron and error partial derivatives,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{ji}} &= \left( \sum_k \delta^n_k y^n_j \frac{\partial a^n_k}{\partial y^n_{j}}\right) \frac{\partial y^n_{j}}{\partial a^n_{j}} \frac{\partial a^n_{j}}{\partial w_{ji}}.\label{eqn:twolayer2}
\end{align} %]]></script>
<p>Recall from <a href="/deltarule/#mjx-eqn-eqn%3Aweightsum">delta rule: (2)</a>, the net activation $a$ is a sum of all previous layer weights. Thus,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{\partial a^n_k}{\partial y^n_{j}} &= \frac{\partial}{\partial y^n_{j}}\\
\left(\sum_j w_{kj} y^n_{j} \right) &= w_{kj},
\end{align*} %]]></script>
<p>and substituting from <a href="/deltarule/#mjx-eqn-eqn%3Apartialdyda">delta rule: (8)</a> and <a href="/deltarule/#mjx-eqn-eqn%3Asumonetermpartial">delta rule: (10)</a> into \eqref{eqn:twolayer2},</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{ji}} &= \left( \sum_k \delta^n_k y^n_j w_{kj}\right) f'\left( a^n_j \right) x_i. \label{eqn:twolayer3}
\end{align} %]]></script>
<p>This bears some resemblance to the derived expression for a single-layer, and just as in <a href="/deltarule/#mjx-eqn-eqn%3Adelta">delta rule: (14)</a>, we can use our definition of the delta to simplify it. For hidden layers this evaluates as
<script type="math/tex">% <![CDATA[
\begin{align}
\delta^n_j &\equiv \frac{\partial E^n}{\partial a^n_k}\\
&= \left( \sum_k \frac{\partial E^n}{\partial e^n_{k}} \frac{\partial e^n_{k}}{\partial y^n_{k}} \right) \frac{\partial y^n_{k}}{\partial a^n_k}\\
&= \left(\sum_k \delta^n_k y^n_j w_{kj} \right) f'\left( a^n_j \right).\label{eqn:deltahidden}
\end{align} %]]></script></p>
<p>This leaves us with the more convenient expression (as we will see when deriving for an arbitrary number of hidden layers),
<script type="math/tex">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{ji}} &= \delta^n_j x_i.
\label{eqn:twolayer4}
\end{align} %]]></script></p>
<h3 id="arbitraryhidden">Arbitrary Number of Hidden Layers</h3>
<p>The derivation above was based on the specific case of a single hidden layer network, but it is trivial to extend this result to multiple hidden layers. There is a recursion in the calculation of the partial derivatives in \eqref{eqn:deltahidden} which holds for a network with any number of hidden layers, and which we will now make explicit.</p>
<p>The delta is defined,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\delta^n_i = \begin{cases}
f'\left( a^n_j \right) e^n_j \,& \textrm{when neuron $j$ is output}\\
f'\left( a^n_j \right) \left( \sum_j \delta^n_j y^n_i w_{ji} \right)& \textrm{when neuron $j$ is hidden},
\end{cases}\label{eqn:fulldeltadefinition}
\end{equation} %]]></script>
<p>for any adjacent neural network layers $i, j$, including the output layer where the outputs are considered to have an index $j$. The sensitivity is then,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{ji}} &= \delta^n_j y_i.
\label{eqn:sensitivity}
\end{align} %]]></script>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/Figure 1. Detailed illustration of a single-layer neural network trainable with the delta rule. The input layer consists of a set of inputs, $\{ X_{0}, \ldots, X_{N} \}$. The layer has weights $\{w_{j0}, \ldots, w_{jN}\}$, bias $b_j$, net neuron activation $a_j = \sum_i w_{ji}$, activation function $f$, and output $y_j$. The error for output $y_j$, $e_j$, is calculated using the target label $t_j$.Backpropagation Derivation - Delta Rule2018-03-16T00:00:00+00:002018-03-16T00:00:00+00:00https://blog.yani.io/deltarule<p>I enjoyed writing my background, however the bit I was really surprised to have enjoyed writing up is the derivation of back-propagation. I’ve read many books, articles and blogs that of course venture to do the same but I didn’t find any of them particularly intuitive. The best I did find were probably that of <a href="/references/#ref-Bishop1995">Bishop (1995)</a> and <a href="/references/#ref-haykin1994neural">Haykin (1994)</a>, which I based my derivation on.</p>
<p>Below I include this derivation of back-propagation, starting with deriving the so-called `delta rule’, the update rule for a network with a single hidden layer, and expanding the derivation to multiple-hidden layers, i.e. back-propagation.</p>
<h1 id="the-delta-rule-learning-with-a-single-hidden-layer">The Delta Rule: Learning with a Single Hidden Layer</h1>
<p>We start by describing how to learn with a single hidden layer, a method known as the delta rule. The delta rule is a straight-forward application of gradient descent (i.e. hill climbing), and is easy to do because in a neural network with a single hidden layer, the neurons have direct access to the error signal.</p>
<figure>
<img src="/assets/images/posts/2018-03-16-deltarule/neurononelayer.svg" alt="Detailed illustration of a single-layer neural network" />
<figcaption>Detailed illustration of a single-layer neural network trainable with the delta rule. The input layer consists of a set of inputs, $\{ X_{0}, \ldots, X_{N} \}$. The layer has weights $\{w_{j0}, \ldots, w_{jN}\}$, bias $b_j$, net neuron activation $a_j = \sum_i w_{ji}$, activation function $f$, and output $y_j$. The error for output $y_j$, $e_j$, is calculated using the target label $t_j$.</figcaption>
</figure>
<p>The delta rule for single-layered neural networks is a gradient descent method, using the derivative of the network’s weights with respect to the output error to adjust the weights to better classify training examples.</p>
<p>Training is performed on a training dataset $X$, where each training sample $\mathbf{x}^n\in X$ is a vector $\mathbf{x}^n = (X^n_0, \ldots, X^n_N)$. Assume that for a given training sample $\mathbf{x}^n$, the $i$<sup>th</sup> neuron in our single-layer neural network has output $y^n_j$, target (desired) output $t^n_j$, and weights $w=(w_{j0}, \ldots, w_{jM})$, as shown in the above figure. We can consider the bias to be an extra weight with a unit input, and thus we can omit the explicit bias from the derivation.</p>
<p>We want to know how to change a given weight $w_{ji}$ given the output of node $j$ for a given input data sample $\mathbf{x}^n$,</p>
<script type="math/tex; mode=display">\begin{equation}
y^n_j = f\left( a^n_j \right),\label{eqn:output}
\end{equation}</script>
<p>where the net activation $a^n_j$ is,</p>
<script type="math/tex; mode=display">\begin{equation}
a^n_j = \sum_i w_{ji} X^n_{i}.\label{eqn:weightsum}
\end{equation}</script>
<p>To do so, we must use the error of our prediction for each output $y_j$ and training sample $X^n$ as compared to the known label $t_j$,
<script type="math/tex">\begin{equation}
e^n_j = y^n_j - t^n_j.\label{eqn:error}
\end{equation}</script></p>
<p>For this derivation, we assume the error for a single sample is calculated by the sum of squared errors of each output. In fact, the derivation holds as long as our error function is in the form of an average (<a href="/references/#ref-Bishop1995">Bishop, 1995)</a>,</p>
<script type="math/tex; mode=display">\begin{equation}
E^n = \frac{1}{2} \sum_j {\left(e^n_j\right)}^2.\label{eqn:errorsum}
\end{equation}</script>
<p>The chain rule allows us to calculate the <em>sensitivity</em> of the error to each weight $w_{ji}$ in the network,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{ji}} &= \frac{\partial E^n}{\partial e^n_j}\frac{\partial e^n_j}{\partial y^n_j} \frac{\partial y^n_j}{\partial a^n_j} \frac{\partial a^n_j}{\partial w_{ji}}.
\end{align} %]]></script>
<p>Differentiating \eqref{eqn:errorsum} with respect to $e^n_j$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial e^n_j} &= e^n_j,
\end{align} %]]></script>
<p>\eqref{eqn:error} with respect to $y^n_j$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial e^n_j}{\partial y^n_j} &= 1,
\end{align} %]]></script>
<p>\eqref{eqn:output} with respect to $a^n_j$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial y^n_j}{\partial a^n_j} &= f'\left( a^n_j \right), \label{eqn:partialdyda}
\end{align} %]]></script>
<p>and finally \eqref{eqn:weightsum} with respect to $w_{ji}$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial a^n_j}{\partial w_{ji}} &= \frac{\partial}{\partial w_{ji}} \left( \sum_i w_{ji} X_{i} \right)\\
&= X_i,\label{eqn:sumonetermpartial}
\end{align} %]]></script>
<p>since only one of the terms in the sum is related to the specific weight $w_{ji}$.</p>
<p>Thus the sensitivity is,
<script type="math/tex">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{ji}} &= e^n_j f'\left( a^n_j \right) X_i. \label{eqn:deltasensitivity}
\end{align} %]]></script></p>
<p>Typically what is variously called the local gradient, error, or simply <em>delta</em>, is then defined,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\delta^n_j &\equiv \frac{\partial E^n}{\partial a^n_j} \\
&= \frac{\partial E^n}{\partial e^n_j}\frac{\partial e^n_j}{\partial y^n_j} \frac{\partial y^n_j}{\partial a^n_j} \\
&= e^n_j f'\left( a^n_j \right),\label{eqn:delta}
\end{align} %]]></script>
<p>such that \eqref{eqn:deltasensitivity} can be rewritten,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{\partial E^n}{\partial w_{ji}} &= \delta^n_j X_i.
\end{align} %]]></script>
<p>The delta rule adjusts each weight $w_{ji}$ proportional to the sensitivity,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Delta w_{ji} &= -\gamma \frac{\partial E^n}{\partial w_{ji}},
\end{align} %]]></script>
<p>where $\gamma$ is a constant called the <em>learning rate</em> or <em>step size</em>. Using the delta defined in \eqref{eqn:delta}, this is simply written,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Delta w_{ji} &= -\gamma \delta^n_j x_i.
\end{align} %]]></script>
<p><a href="/backpropagation">Continue to Multi-layer Backprogatation Derivation</a></p>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/I enjoyed writing my background, however the bit I was really surprised to have enjoyed writing up is the derivation of back-propagation. I’ve read many books, articles and blogs that of course venture to do the same but I didn’t find any of them particularly intuitive. The best I did find were probably that of Bishop (1995) and Haykin (1994), which I based my derivation on.Background Tutorials2018-03-15T00:00:00+00:002018-03-15T00:00:00+00:00https://blog.yani.io/thesisbackground<p>Recently I defended by <a href="https://yani.io/annou/thesis_online.pdf">PhD thesis</a>, and was told by my examiners it might be good to share some of the background I wrote (which was honestly a bit excessive!) as tutorials. From today I’m going to progressively start putting these on my blog, and will update this post with links to each of them.</p>
<ul>
<li><a href="/deltarule">Backpropagation Derivation - Delta Rule</a></li>
<li><a href="/backpropagation">Backpropagation Derivation - Multi-layer Neural Networks</a></li>
<li><a href="/sgd">Learning with Backpropagation</a></li>
</ul>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/Recently I defended by PhD thesis, and was told by my examiners it might be good to share some of the background I wrote (which was honestly a bit excessive!) as tutorials. From today I’m going to progressively start putting these on my blog, and will update this post with links to each of them.A Tutorial on Filter Groups (Grouped Convolution)2017-08-10T00:00:00+01:002017-08-10T00:00:00+01:00https://blog.yani.io/filter-group-tutorial<p>Filter groups (AKA grouped convolution) were introduced in the now seminal <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks">AlexNet paper</a> in 2012. As explained by the authors, their primary motivation was to allow the training of the network over two Nvidia GTX 580 gpus with 1.5GB of memory each. With the model requiring just under 3GB of GPU RAM to train, filter groups allowed more efficient model-parellization across the GPUs, as shown in the illustration of the network from the paper:</p>
<figure>
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/alexnetarchitecture.svg" alt="AlexNet Architecture" />
<figcaption>The architecture of AlexNet as illustrated in the original paper, showing two separate convolutional filter groups across most of the layers (<a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks">Alex Krizhevsky et al. 2012</a>).</figcaption>
</figure>
<p>The vast majority of deep learning researchers had explained away filter groups as an engineering hack, until the initial publication of the <a href="https://arxiv.org/abs/1605.06489">Deep Roots</a> paper in May 2016. Indeed it was clear that this was the primary reason for their invention, and by removing parameters surely accuracy was decreased?</p>
<h2 id="not-just-an-engineering-hack">Not just an Engineering Hack!</h2>
<figure>
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/alexnetfilters.png" alt="AlexNet Filters" />
<figcaption>AlexNet <tt>conv1</tt> filter separation: as noted by the authors, filter groups appear to structure learned filters into two distinct groups, black-and-white and colour filters (<a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks">Alex Krizhevsky et al. 2012</a>).</figcaption>
</figure>
<p>However, even the AlexNet authors noted, back in 2012, that there was an interesting side-effect to this engineering hack - the <tt>conv1</tt> filters being easily interpreted, it was noted that filter groups seemed to consistently divide <tt>conv1</tt> into two separate and distinct tasks: black and white filters and colour filters.</p>
<figure>
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/alexnetgroupgraph.svg" alt="AlexNet with Varying Numbers of Filter Groups" />
<figcaption>AlexNet trained with varying numbers of filter groups, from 1 (i.e. no filter groups), to 4. When trained with 2 filter groups, AlexNet is more efficient and yet achieves the same if not lower validation error.</figcaption>
</figure>
<p>What wasn’t noted explicitly in the AlexNet paper was the more important side-effect of convolutional groups, that they learn <strong>better representations</strong>. This seems like quite the extraordinary claim, however this is backed up by one simple experiment: train AlexNet with and without filter groups and observe the difference in accuracy/computational efficiency. This is illustrated in the graph above, and as can be seen, not only is AlexNet without filter groups less efficient (both in parameters and compute), but it is also slightly less accurate!</p>
<h2 id="how-do-filter-groups-work">How do Filter Groups Work?</h2>
<figure>
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/convlayer.svg" alt="Normal Convolutional Layer" />
<figcaption>A normal convolutional layer. Yellow blocks represent learned parameters, gray blocks represent feature maps/input images (working memory).</figcaption>
</figure>
<p>Above is shown a normal convolutional layer, with no filter groups. Unlike most illustrations of CNNs, including that of AlexNet, here we explicitly show the <em>channel dimension</em>. This is the third dimension of a convolutional feature map, where the output of each filter is represented by one channel. In illustrations like this it is clear that the spatial dimension of a featuremap is often the tip of the iceberg, as we get deeper in a CNN, the number of channels rapidly increases (with the increase in the number of filters), while the spatial dimensions decrease (with pooling/strided convolution). Thus in much of the network, the channel dimension will dominate.</p>
<figure>
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/filtergroups2.svg" alt="Convolutional Layer with Filter Groups" />
<figcaption>A convolutional layer with 2 filter groups. Note that each of the filters in the grouped convolutional layer is now exactly half the depth, i.e. half the parameters and half the compute as the original filter.</figcaption>
</figure>
<p>Above is illustrated a convolutional layer with 2 filter groups, where each the filters in each filter group are convolved with only half the previous layer’s featuremaps. Unlike in the AlexNet illustration, with the third dimension shown it is immediately obvious that the grouped convolutional filters are much smaller than their normal counterparts. With two filter groups, as used in most of AlexNet, each filter is exactly half the number of parameters (yellow) of the equivalent normal convolutional layer.</p>
<h2 id="why-do-filter-groups-work">Why do Filter Groups Work?</h2>
<p>This is where it gets a big more complicated. It’s not immediately obvious that filter groups should be of any benefit, but they are often able to learn more efficient and better representations. This is because <strong>filter relationships are sparse</strong>.</p>
<figure>
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/cifar-nin-4pad-conv8-corr.png" alt="No Filter Groups" />
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/colorbar.svg" alt="Colour Bar" />
<figcaption>The correlation matrix between filters of adjacent layers in a Network-in-Network model trained on CIFAR10. Pairs of highly correlated filters are brighter, while lower correlated filters are darker.</figcaption>
</figure>
<p>We can show this by looking at the correlation across filters of adjacent layers. As shown above, the correlations are generally quite low, although in a standard network there is no discernable ordering of these filter relationships, they are also different between models trained with different random initializations. What about with filter groups?</p>
<figure>
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/cifar-nin-groupanimation.gif" alt="No Filter Groups" />
<img src="/assets/images/posts/2017-08-10-filter-group-tutorial/colorbar.svg" alt="Colour Bar" />
<figcaption>The correlations between filters of adjacent layers in a Network-in-Network model trained on CIFAR10, when trained with 1, 2, 4, 8 and 16 filter groups.</figcaption>
</figure>
<p>The effect of filter groups is to learn with a <em>block-diagonal</em> structured sparsity on the channel dimension. As can be seen in the correlation images, the filters with high correlation are learned in a more structured way in the networks with filter groups. In effect, filter relationships that don’t have to be learned are no longer parameterized. In reducing the number of parameters in the network in this salient way, it is not as easy to over-fit, and hence a regularization-like effect allows the optimizer to learn more accurate, more efficient deep networks.</p>
<h2 id="unanswered-questions">Unanswered Questions</h2>
<p>How do we decide the number of filter groups to use? Can filter groups overlap? Do all groups have to be the same size, what about heterogeneous filter groups?</p>
<p>Unfortunately for the moment these are questions yet to be fully answered, although the latter has recently received some attention <a href="https://arxiv.org/abs/1707.09855">from Tae Lee et al., at KAIST</a>.</p>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/Filter groups (AKA grouped convolution) were introduced in the now seminal AlexNet paper in 2012. As explained by the authors, their primary motivation was to allow the training of the network over two Nvidia GTX 580 gpus with 1.5GB of memory each. With the model requiring just under 3GB of GPU RAM to train, filter groups allowed more efficient model-parellization across the GPUs, as shown in the illustration of the network from the paper:CuDNN Now Accelerates Filter Groups!2017-08-09T00:00:00+01:002017-08-09T00:00:00+01:00https://blog.yani.io/cudnn-filter-groups<p>With the latest cuDNN 7 release, a request of mine to the cuDNN team from just over a year ago has finally come to fruition - filter groups are now properly handled by the popular framework which provides accelerated code for common deep learning operations on Nvidia GPUs, according to the release notes:</p>
<blockquote>
<p>Grouped Convolutions for models such as ResNeXt and Xception and CTC (Connectionist Temporal Classification) loss layer for temporal classification.</p>
</blockquote>
<p>For more information, see the <a href="https://developer.nvidia.com/cudnn">CuDNN release notes</a>. Thanks to <a href="https://twitter.com/mfigurnov">Michael Figurnov</a> for pointing this out to me!</p>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/With the latest cuDNN 7 release, a request of mine to the cuDNN team from just over a year ago has finally come to fruition - filter groups are now properly handled by the popular framework which provides accelerated code for common deep learning operations on Nvidia GPUs, according to the release notes:A Brief Review of Computer Vision and Pattern Recognition (CVPR) 20172017-07-27T00:00:00+01:002017-07-27T00:00:00+01:00https://blog.yani.io/cvpr-2017<p><em>Mean paper name: “Deep … in the wild”</em></p>
<p>With what must be one of the best locations for a computer vision conference ever, CVPR 2017 was always going to be one of the best, but even considering that this conference was notably better organized than most of it’s predessesors. This was despite a massive increase in attendance - over 37% more than CVPR 2016 in Los Vegas <a href="http://cvpr2017.thecvf.com/files/CVPR2017_opening_ceremony.pdf">see here for the slides describing this year’s stats</a>, and a 40% increase in paper submissions. Notable changes this year included moving to a 3 track conference instead of 2 track. Even so, one of the days had a free afternoon, and the organizers noted that this was at least partially because reviewers had not nominated enough orals this year to fill the schedule.</p>
<h2 id="orals">Orals</h2>
<p>Of the few papers that were orals, I was impressed with the following:</p>
<ul>
<li>“Unsupervised Learning of Depth and Ego-Motion from Video”: This work is simply impressive (full disclaimer, one of the co-authors is my former PhD supervisor). It learns depth simply from unlabelled data of car driving around urban scenes. The weaknesses of the method seemed to be that it learned a bias in the data not to expect any close depth in the center of the frame (caused by correctly cautious drivers leaving a gap between themselves and the car in front), and a lack of good perforamance on non-urban scenes. In both cases the authors claimed that these were solvable by having more data. However, I believe there simply aren’t enough features outside of urban canyons for this method to suceed outside of urban environments.</li>
<li>“Learning From Simulated and Unsupervised Images””: Although the work is interesting, and potentially very useful, this oral stood out for another reason. It was explicitly an oral by Apple researchers, marking a significant shift for a company that only a few years ago wouldn’t even let it’s CVPR attending employees admit their affiliation - nevermind publish it’s research. Hopefully this newfound openness continues.</li>
<li>“Densely Connected Convolutional Networks”: This work presented some very impressive results in a domain I’m particularly interested in. I’m surprised such a network can be more computationally efficient while reducing error significantly. I will be trying to repeat these results for sure.</li>
<li>“Global Optimality in Neural Network Training”: With little theoretical progress in understanding the optimization of deep networks, I like this work because it has relatively few assumptions compared to many theoretical analysis, and yet has relatively large claims. Whether it can be put to effect in practice however remains to be seen.</li>
<li>“YOLO9000: Better, Faster, Stronger”: This oral surprisingly suceeded despite (because?) of the ridiculous title, and the theme of the presentation (daft punk’s stronger, faster). The presentation was engaging, and the results very impressive. A live demo of their realtime object detection system not only worked, but seemed to go beyond even the author’s expectations when the object detection system identified most of the audience in view along with the planned foreground objects.</li>
</ul>
<h2 id="industrial-presence">Industrial Presence</h2>
<p>As with every year, the industrial presence at CVPR grew more this year. While the number of companies increased, this was mostly due to the large number of new startups with a booth presence, rather than the larger companies. If anything the larger US companies had a slightly reduced prescense overall, especially when I compare with NIPS. This had the notable exception of Apple however, which organized <strong>three</strong> separate events, two `mixers’ and one technical session, and Nvidia, whose CEO attended the conference and made a major new GPU announcement. Notably this year there were many more asian companies with a presence, Chinese (Tencent, and too many to list), and Korean companies (Naver, Samsung).</p>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/Mean paper name: “Deep … in the wild”No Free Lunch2017-07-19T00:00:00+01:002017-07-19T00:00:00+01:00https://blog.yani.io/no-free-lunch<p>A fundamental topic, and yet one often left until later to learn, is that of the so-called ``No Free Lunch’’ theorem. If I was asked to summarize the lesson of this theorem in one line, it would simply be:</p>
<p>Machine learning is <strong>not</strong> magic</p>
<p>And this is a very important lesson indeed. Perhaps the best intuition behind the theorem is to be gained by the following simple example. Assume we see a sequence of random numbers,</p>
<script type="math/tex; mode=display">x = { 1, 3, 9, \ldots}</script>
<p>and we are asked to predict the next number in the sequence. Most of us would probably predict,
<script type="math/tex">x = { 1, 3, 9, 27, \ldots}</script>,
assuming that the sequence at each time step is being generated by <script type="math/tex">x_t = 3 x_{t-1}</script>.</p>
<p>However, there is no reason to not instead believe the hypothesis that this sequence is simply the output of a random number generator, in which case the next item of the sequence can not easily predicted. Even if it seems very unlikely (and Occam’s razor is a good argument here), we cannot completely disprove this hypothesis without seeing <strong>all</strong> the data points.</p>
<blockquote>
<p>``Everything, but the data, is an assumption’’</p>
<p><cite><a href="http://mlg.eng.cam.ac.uk/zoubin/">Zoubin Gahramani</a> MSR AI Summer School 2017</cite></p>
</blockquote>
<p>The only reason machine learning works at all are the assumptions we make about the problem. We call these assumptions the <em>model</em>. Whatever assumptions we make in our model will, of course, only help predictions with the types of problems where those assumptions hold, while hindering prediction with other types of problems. Wolpert et al. show that this means at best, <strong>over all possible input data distributions</strong>, we can not expect any model to do better than random.</p>
<p>This is often cited as the nail in the coffin for the idea of a universal learning algorithm, and algorithm that can learn any problem. Theoretically this is certainly true, however we are interested only in learning real-word problems, of which the data lies in a specific subset of all possible distributions.</p>Yani Ioannouyani.ioannou@gmail.comhttps://yani.io/annou/A fundamental topic, and yet one often left until later to learn, is that of the so-called ``No Free Lunch’’ theorem. If I was asked to summarize the lesson of this theorem in one line, it would simply be: