A Demo to Understand Automatic Differentiation

Training Progress

Round: Loss: Step: Graph: High res Graph: Norm of gradient:

Add Data

Click to add

Model

Input (x, y)
Output (r,g,b)
Add layer Bias:

Training Options

Learning Rate: Rounds:
Normalize: Clip gradient: Optimizer

What Started This?

For a while I have known about automatic differentiation via dual numbers which is quite slick, but most deep learning frameworks don't use them. Most deep learning frameworks build up some "computational graph" and use "reverse" automatic differentiation. I never really thought about this too much until I came across Karpathy's micrograd . I was surprised at how little code there was. Briefly looking at the code, I realized the math it was doing seemed strange. Yes, it was clearly doing derivative stuff, but it seemed off / not what I would expect. So I read the Wikipedia article on automatic differentiation and made this demo to learn more.

The code implementing the reverse mode automatic differentiation is quite similar to Micrograd (but is in Javascript rather than Python) and the graphical interface for the demo is inspired by Tensorflow's Playground . If you haven't seen that, you should also check it out. Their UI is very nice.

Different Ways to Differentiate

Symbolic Differentiation

This is the way most people work out derivatives, via symbolic manipulation. For example, let \(f(x) = x^2 \), then we can symbolically calculate the derivative as \( \frac{d}{dx}f(x) = 2x \) (using the power rule for example). This is what is taught in Calculus class and what programs like Mathematica (see WolframAlpha), Maple, and other Computer Algebra Systems do.

Numerically

Another way of computing derivatives is "numerically." This is often used when dealing with engineering or science problems. Here the exact derivative isn't found, rather a numerical approximation is found. This can be done with the "forward difference" formula: \[ \frac{d}{dx}f(x) \approx \frac{f(x + \Delta x) - f(x)}{\Delta x} \] for some small \( \Delta x \), i.e. \( \Delta x = 0.00001 \). For more accurate results, use smaller \( \Delta x \). This is what programs like Matlab and others do.

Automatic Differentiation

Automatic differentiation is different than both of these approaches. It realizes that whatever function you're going to compute on a computer is made up of a bunch of simpler functions like addition and multiplication. By working with these simpler functions, one can automatically compute derivatives. This can be done in a "forward" way, with dual numbers, or in a "reverse" way by clever application the chain rule. Dual numbers are like the complex numbers except instead of \( i^2 = -1 \) you have \( \varepsilon^2 = 0 \). The clever use of the chain rule is that suppose you have some loss function \( L(x, y) = f(g(x, y), h(x, y)) \) and you want to find \( \frac{\partial L}{\partial x}, \frac{\partial L}{\partial y} \). The normal way you might work this out: \[ \frac{\partial L}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} + \frac{\partial f}{\partial h} \frac{\partial h}{\partial x} \] \[ \frac{\partial L}{\partial y} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial y} + \frac{\partial f}{\partial h} \frac{\partial h}{\partial y} \] There are duplicate terms: \(\frac{\partial f}{\partial g} \) and \(\frac{\partial f}{\partial h} \) that you'd probably end up computing twice. Instead a "computational graph" is formed to allow this computation to be done only once.

All of this will probably make more sense with a fully worked out example.

A Fully Worked Example

Lets see this in action by computing the gradient of \(L(x, y) = (x+y)(x+2y)\) at the point \( (x, y) = (2, 3) \).

Symbolically

As humans, doing this symbolically, we'd expand this out as \(x^2 + 3xy + 2y^2 \), then take the partials: \[ \frac{\partial}{\partial x}(x^2 + 3xy + 2y^2) = 2x + 3y \] and \[ \frac{\partial}{\partial y}(x^2 + 3xy + 2y^2) = 3x + 4y \] then plug in \((2,3)\) to get \( \boxed{(13, 18)} \).

Numerically

Now lets compute it numerically, we'll use \( \Delta x = 0.1 \) and \( \Delta y = 0.1 \), and we'll do it again for \( \Delta x = 0.001 \) and \( \Delta y = 0.001 \) to show how our numerical answer gets better.

The \( \Delta = 0.1 \) case: \begin{align} \frac{\partial L}{\partial x}(2, 3) & \approx \frac{L(2.1, 3) - L(2, 3)}{0.1} \\ & = \frac{(2.1 + 3)(2.1 + 6) - (2 + 3)(2 + 6)}{0.1} \\ & = \frac{41.31 - 40}{0.1} \\ & = \boxed{13.1} \end{align} \begin{align} \frac{\partial L}{\partial y}(2, 3) & \approx \frac{L(2, 3.1) - L(2, 3)}{0.1} \\ & = \frac{(2 + 3.1)(2 + 6.2) - (2 + 3)(2 + 6)}{0.1} \\ & = \frac{41.82 - 40}{0.1} \\ & = \boxed{18.2} \end{align} Fairly close, now if we use \( 0.001 \) we should get even closer: \begin{align} \frac{\partial L}{\partial x}(2, 3) & \approx \frac{L(2.001, 3) - L(2, 3)}{0.001} \\ & = \frac{(2.001 + 3)(2.001 + 6) - (2 + 3)(2 + 6)}{0.001} \\ & = \frac{40.013001 - 40}{0.001} \\ & = \boxed{13.001} \end{align} \begin{align} \frac{\partial L}{\partial y}(2, 3) & \approx \frac{L(2, 3.001) - L(2, 3)}{0.001} \\ & = \frac{(2 + 3.001)(2 + 6.002) - (2 + 3)(2 + 6)}{0.001} \\ & = \frac{40.018002 - 40}{0.001} \\ & = \boxed{18.002} \end{align} As expected, with a smaller difference we get more accurate results.

Forward Automatic Differentiation via dual numbers

Dual numbers behave like complex numbers, except instead of having \( i^2 = -1 \) we have \( \varepsilon^2 = 0 \). This might seem strange, but has a cool effect; to compute our partial derivatives we'll just need to plug in \( 2 + \varepsilon \) and \( 3 + \varepsilon \), then look at the "dual part" of the result: \begin{align} L(2 + \varepsilon, 3) & = (2 + \varepsilon + 3)(2 + \varepsilon + 6) & \\ & = (4 + 2 \cdot \varepsilon + 12) & \text{expand} \\ & + (\varepsilon \cdot 2 + \varepsilon^2 + \varepsilon \cdot 6) & \\ & + (6 + 3 \cdot \varepsilon + 18) & \\ & = (4 + 12 + 6 + 18) & \text{group terms} \\ & + (2 + 2 + 6 + 3) \cdot \varepsilon & \\ & + \varepsilon^2 & \\ & = 40 + \boxed{13} \cdot \varepsilon + 0 & \text{simplify} \\ \end{align} \begin{align} L(2, 3 + \varepsilon) & = (2 + 3 + \varepsilon)(2 + 2 \cdot (3 + \varepsilon)) & \\ & = (2 + 3 + \varepsilon)(2 + 6 + 2 \cdot \varepsilon) & \\ & = (4 + 12 + 4 \cdot \varepsilon) & \\ & + (6 + 18 + 6 \cdot \varepsilon) & \\ & + (\varepsilon \cdot 2 + \varepsilon \cdot 6 + 2 \cdot \varepsilon^2 ) & \text{expand} \\ & = (4 + 12 + 6 + 18) & \\ & + (4 + 6 + 2 + 6) \cdot \varepsilon & \\ & + 2 \cdot \varepsilon^2 & \text{group terms} \\ & = 40 + \boxed{18} \cdot \varepsilon + 0 & \text{simplify} \end{align} And we see, like mathemagic, dual numbers just spit out the answer, exactly.

Reverse Automatic Differentiation via a computational graph

Here we need to break down our function to be written as a composition of simpler functions, as a computer would calculate it, \[ L(x, y) = (x + y)(x + 2y) \] becomes: \[ L(x, y) = f(g(x, y), h(x, k(y))) \] where \( f(a, b) = ab \), \( g(x, y) = x + y \), \( h(a, b) = a + b \), and \( k(y) = 2y \).

Next, as it turns out we'll need a lot of intermediate values for this to work. As such, this reverse mode process starts by computing the function in a forward manner, keeping track of the intermediate values, then does a backwards pass. So lets do that now.

Our point is \( (x, y) = (2, 3) \) so plugging in: \[ k(y) = 6 \] \[ h(x, k(y)) = h(2, 6) = 8 \] \[ g(x, y) = g(2, 3) = 5 \] \[ f(g(x, y), h(x, k(y))) = f(5, 8) = 40 = L(x, y) \] Good, now we begin the reverse process: \[ \frac{\partial L}{\partial f} = 1 \] easy. Next we have \( f(g, h) = gh \), \( \frac{\partial f}{\partial g} = h \), \( \frac{\partial f}{\partial h} = g \), so: \[ \frac{\partial L}{\partial g} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial g} = 1 \cdot h = 8 \] \[ \frac{\partial L}{\partial h} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial h} = 1 \cdot g = 5 \] Keep going. We basically want to just keep taking derivatives until we can't anymore.

However, at this point we sort of have a fork. Both \( g \) and \( h \) (and \( k \)) depend on \( x \) and \( y \). For this to work, we'll keep track of their contributions separately. We'll do this with some non-standard notation: let \( x_g \) and \( y_g \) indicate \( g \)'s contribution to \( x \) and \( y \) respectively. Likewise, we'll subscript \( h \) and \( k \) as needed. This is to try to keep things clear and motivate the algorithm, we will explained why this is mathematically sound at the end.

For now, we have \( g(x_g, y_g) = x_g + y_g \), \( \frac{\partial g}{\partial x_g} = 1 \), \( \frac{\partial g}{\partial y_g} = 1 \), so we have: \[ \frac{\partial L}{\partial x_g} = \frac{\partial L}{\partial g} \frac{\partial g}{\partial x_g} = 8 \cdot 1 = 8 \] \[ \frac{\partial L}{\partial y_g} = \frac{\partial L}{\partial g} \frac{\partial g}{\partial y_g} = 8 \cdot 1 = 8 \] Next, \( h(x_h, k) = x_h + k \), \( \frac{\partial h}{\partial x_h} = 1 \), \( \frac{\partial h}{\partial k} = 1 \) : \[ \frac{\partial L}{\partial x_h} = \frac{\partial L}{\partial h} \frac{\partial h}{\partial x_h} = 5 \cdot 1 = 5 \] \[ \frac{\partial L}{\partial k} = \frac{\partial L}{\partial h} \frac{\partial h}{\partial k} = 5 \cdot 1 = 5 \] Then, \( k(y_k) = 2y_k \), \( \frac{\partial k}{\partial y_k} = 2 \) : \[ \frac{\partial L}{\partial y_k} = \frac{\partial L}{\partial k} \frac{\partial k}{\partial y_k} = 5 \cdot 2 = 10 \] Finishing it up: \[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial x_g} + \frac{\partial L}{\partial x_h} = 8 + 5 = \boxed{13} \] \[ \frac{\partial L}{\partial y} = \frac{\partial L}{\partial y_g} + \frac{\partial L}{\partial y_k} = 8 + 10 = \boxed{18} \] Excellent! We got the correct gradient!

Justification for contribution notation

Hopefully all of that was clear, with maybe the exception of the contributor notation. Let's justify that. Suppose we have \( f(g(x), h(x)) \) and we want \( \frac{d f}{d x} \). Then we have from the chain rule: \[ \frac{d f}{d x} = \frac{df}{dg} \frac{dg}{dx} + \frac{df}{dh}\frac{dh}{dx} \] So if we calculate \( \frac{df}{dg} \frac{dg}{dx} \) and \( \frac{df}{dh}\frac{dh}{dx} \) separately, then while we're doing that we can rename the \( x \) in \( g(x) \) to be \( g(x_g) \) and compute \( \frac{dg}{dx_g} \). Likewise we can do the same for the \( x \) in \( h(x) \) to be \( h(x_h) \) and compute \( \frac{dh}{dx_h} \). Thus we have: \[ \frac{d f}{d x} = \frac{df}{dg} \frac{dg}{dx_g} + \frac{df}{dh}\frac{dh}{dx_h} \] And this is why we added the "contributions" in the last step to get the gradient above.

Reverse Mode is Usually Faster

For this example, it might seem like the reverse automatic differentiation was a lot more work than any of the other methods. This is true only because we just had 2 parameters, \( x \) and \( y \). If we there were more parameters, we'd start to see how much faster it is. For example, suppose we had some \( \mathscr{L} \) that has 100 parameters. Symbolically we'd need to manipulate \( \mathscr{L} \) 100 times, numerically we'd need to compute it 100 times, and with dual numbers we'd need to evaluate it 100 times. With the reverse mode we'd calculate it once, then backpropagate the derivative contributions once through the whole thing. Much less computation! This is why this reverse mode automatic differentiation is essential to training neural networks.