## The conceptual framework behind gradient descent, backpropagation, and Software 2.0.

Differentiable code makes the Machine Learning world go round. PyTorch, TensorFlow, and any other modern Machine Learning framework all rely on it when it comes to the actual learning. Reason enough for anybody even remotely interested in AI to be acquainted with differentiable code. We invite you to follow us through a two-part explanation of the conceptual framework that powers a large part of today’s AI industry. No Ph.D. required!

Differentiable code is vital to Machine Learning because it is the basis of gradient descent, a family of optimization algorithms doing the grunt work in the training of Machine Learning models. The adjective *differentiable* comes from the noun *differentiation, *the mathematical process of computing derivatives. Yes, the good old derivatives from your high-school calculus class. Indeed, you get amazingly far with high-school math, when it comes to AI. For example, as we will see below, the *gradient* in the name *gradient descent* is essentially a collection of derivatives.

To train even a moderately sizeddeep neuronal network running (DNN) using gradient descent, you have to compute *a lot* of derivatives. That’s why training efficiency depends on the ability to efficiently differentiate. Enters *automatic* differentiation (AD). AD is how computers calculate derivatives. AD, and in particular its so-called reverse-mode (aka backpropagation), is the algorithmic driver behind the current AI gold rush. Without it, serious Machine Learning applications would not be feasible — leave alone profitable — even on modern-day cloud infrastructure. In fact, if you see a deep neuronal network running *anywhere*, it has been trained using reverse-mode AD.

Today’s post will answer the following questions:

- What is Software 2.0 and why might the idea of differentiable code be even bigger than Machine Learning?
- How does my high-school calculus class relate to gradient descent, the work-horse of Machine Learning?
- What’s does it mean to differentiate code? How can one possibly compute a derivative of a piece of, say, Python, Scala, or C code?
- Why can and should differentiation be automated?

## Software 2.0, or Fill-in-the-Blanks Programming

With Machine Learning becoming the norm in the late 2010s, AI celebrities such as Yann LeCun or Andrej Karpathy popularized the idea of Software 2.0: If neuronal networks, a special kind of computer programs, can learn their optimal parameters based on sufficient training data, then why don’t we design *all* programs to optimize themselves based on data describing their correct behavior?

To illustrate this idea, imagine the space of all possible programs – the vast program space, where every possible computer program is represented by a single point. Software 2.0 is when you do not fix your entire program but rather select a subset of program space – a parameterizable class of possible programs – and leave it to the computer to find the optimal choice within this class. This is what Andrej Karpathy calls „fill-in-the-blanks programming“. You define the frame, Software 2.0 paints the picture.

*Differentiable programs automatically optimize within a subset of program space. (Inspired by Software 2.0, A. Karpathy, 2017)*

*The program from the tester’s viewpoint.*

*optimal*? Here, it turns out, computers are not that different from human software developers, after all. Rather than a vague verbal explanation of software requirements, most developers prefer objectively testable specifications, ready to be integrated into an automated test suite. So when we provide a large set of tests to our Software 2.0, the computer systematically tries out various concrete programs, tracing out its subset of program space. For each of these programs it measures how far it is away from an optimal program by how well it satisfies the test suite.

*soft unit tests*(SUT). A SUT is basically your regular unit test with the only difference that it does not give you a binary response (green light on success, red light on failure) but instead calculates a score to quantify how far you are away from the expectation. Testing addition, for example, a SUT would tell you that 2+2=4.1 is less wrong than 2+2=-11.

*x*, it produces an output

*f*(

*x*). Of course, if you would cut the black box open, you would reveal a (possibly huge) number of internal parameters. In a conventional program, all these parameters would be fixed to definite values. In Software 2.0, however, by way of tuning these parameters, the program „moves“ through program space.

*x*and an output

*y*. The input is inserted into the program, which produces from it the output

*f*(

*x*). This output is compared to the expected outcome

*y*and receives a good score when both values are approximately equal. The result of the comparison is fed back into the Software 2.0 algorithm where it is used to tune the parameters to incrementally improve the test score.

*Cutting the black box open reveals its internal parameters.*

*A Software 2.0 program subject to N soft unit tests tunes its internal parameters to match output f(xᵢ) and expectation yᵢ.*

After a relatively short-lived hype, the term Software 2.0 didn’t become quite the buzzword its inventors had hoped for. However, if you have been anywhere near the Machine Learning crowd in the last decade, you likely have heard of the most relevant subclass of Software 2.0: *differentiable code*, sometimes written ∂Code („∂“ is the symbol of partial differentiation).

A first differentiable program

*(The idea of this section has been adapted from Differentiable Functional Programming, N. Welsh, 2018)*

Imagine you’d have to write a fairly simple program. The requirements are as follows: The program has to accept a single floating-point number as an input parameter *x*. From this input, it has to produce an output *y*, another float. You are given 200 soft unit tests describing the required behavior. Each test is defined by a pair (*x*,*y*) of one input *x* and one expected output *ŷ*.

Since the unit tests are soft, there has to be a measure of how well or bad the program did on each test. For this purpose we introduce the scoring metric

The metric 𝓛⁽ⁱ⁾ is the so-called *loss* of the *i*-th test. Optimally, it would be zero, because then input *x*⁽ⁱ⁾ would exactly match expected output *ŷ*⁽ⁱ⁾. The overall test suite performance metric 𝓛 of our program is then simply the sum of all the 200 𝓛⁽ⁱ⁾, or, to put it as a mathematician’s for-loop,

The smaller 𝓛, the better our overall performance. Since the tests are pairs of numbers, we can visualize all 200 in one 2-D plot, with the inputs on the *x*-axis and the outputs on the *y*-axis. Let’s assume the plot of your test data looks like the picture below on the left.

*A plot of 200 soft unit tests. Each point represents a pair of input and expected output.*

*ω*(that’s a lower-case „omega“), denoting the frequency of the sine. Therefore, we are looking at a one-dimensional subset of program space that you can visualize as a line from

*ω*=-∞ to

*ω*=∞, just like the

*x*-axis goes from

*x*=-∞ to

*x*=∞.

*ω-*value corresponds to another version of our program

*f*(

*x*). Let’s say we set

*ω*to 12.345, then our program becomes So, let’s give it a try! Let’s choose some

*ω*-values and see if we can’t satisfy all those unit tests.

*Test suite performance for parameter values ω=0.75, 2.25, 1.0, and 1.5. Left: the program’s outputs (solid line) vs. the expected outputs (dots). In one frame, three orange bars on the left indicate the point-wise error ℒ⁽ⁱ⁾ for three example points. Right: 𝓛-score for each ω-value. *

*ω*=1.5 is by far the best result. But could there be a better solution, at 1.54321, say? In this example, there is not. In reality though, by random choice, it might have taken you dozens or hundreds of attempts to find a sufficiently good solution and you’d probably still not have arrived very close to the correct 1.5.

*unknowns*, in this function? Despite the 200

*x*–

*y*-pairs, 𝓛 is structurally very simple. Admittedly, for a human it would be a bit unwieldy to write out the full sum of 200 terms, one for each SUT. For a computer, though, a sum with 200 terms is not scarier than a sum of two. What remains is a simple one-parameter function, 𝓛(

*ω*). If you look at its definition above, you see that it is completely determined by the test suite’s x⁽ⁱ⁾- and

*ŷ*⁽ⁱ⁾-values. This means, there is no need for randomized sampling parameter values, at all. You already know the function’s value at

*every single point!*You can easily plot 𝓛(

*ω*), just like any other function:

*The program’s overall test suite performance (loss score) as function of frequency parameter ω. *

*ω*, you simply have to find a minimum of the function 𝓛(

*ω*). Finding minima – wait a minute, didn’t we do that all the time in our high-school calculus class? Indeed. And it had something to do with derivatives. Fortunately, 𝓛(

*ω*) is not only structurally simple but also

*differentiable*. The term

*differentiable*means that you can calculate a derivative of 𝓛(

*ω*). In fact, the whole reason why our little example here is called

*differentiable code*is that 𝓛(

*ω*), the function towards which we want to optimize, is differentiable Let us refresh our calculus memories to see how the derivative can help us to systematically optimize.

## Calculus recap

*f*(

*x*) at point

*x*measures the local slope of the function. You can view the derivative as a tool that always tells you where’s up and where’s down on a function. Truth is, that’s all you need to know to understand the essential idea of gradient descent and differentiable code.

*The derivative is the slope of a line connecting two points in the limit of vanishing distance between said points.*

To see, how the derivative relates to the slope, let us take a look at its definition. Commonly, the derivative is written *f *′ (read: „eff prime“) and defined as a mathematical limit of the form

This expression has a beautiful geometrical interpretation (see the animation above on the right):

*The slope is change in height divided by change in position, that is, Δf/Δx.*

*x*,

*f*(

*x*)) to (

*x*+

*Δx*,

*f*(

*x*+

*Δx*)) as you reduce the horizontal distance

*Δx,*bringing the points closer and closer together. In the mathematical limit

*Δx→0*the points practically merge. The slope is then quite naturally defined by

Just as it is natural to describe a slope by saying how much height changes per distance, it is natural to describe the slope of a function by saying how much its value changes per parameter change.

*Direction*here means literally along which direction you move along the

*x*-axis, that is, whether you increase or decrease the parameter

*x*.

*In minima and maxima the slope is zero (horizontal), that is, f′=0.*

What does that mean now, exactly? Consider you are currently at parameter value *x₀*. There, the local derivative is *f*′(*x₀*). If the derivative is larger than zero, it means that by increasing *x* you go up and by decreasing *x* you go down. So, what do you do to come closer to your objective, the highest point? You increase *x*. More mathematically speaking, you choose a new parameter value *x* by taking a step in the direction of positive slope:

where *ɛ* quantifies your step size. In Machine Learning lingo, the step size is called the learning rate.

This formula automatically covers the opposite case, where the derivative is less than zero. Negative *f* ′ means that for larger *x*, you go down, farther away from your objective, the highest point. Therefore, at negative *f* ′(*x₀*), going upwards means taking a step towards a *smaller* *x*-value.

If you want to minimize the function value, you proceed similarly, only that now you take *ɛ*-steps in the direction of negative slope (downhill):

The process is visualized below.

*The derivative is used to iteratively find a local maximum (left) or minimum (right).*

*x*ₘᵢₙ,₂ than to

*x*ₘᵢₙ,₁, minimization using the described method will always bring you to

*x*ₘᵢₙ,₂ and never to

*x*ₘᵢₙ,₁, although it would be an even lower, that is,

*better*, point.

*f*′=0 property, that will interfere with your quest for an optimum.

*ɛ*. If it is too large, you risk overshooting, that is, you might jump over an optimum instead of approaching it. If, on the other hand, it is too small, you might have to take

*a lot*of steps to come close enough to an optimum. Even worse, the step size is actually not

*ɛ*but

*ɛ*⋅

*f*′ and thus proportional to the derivative’s value. Since at the extremum the derivative becomes zero when you are approaching it, the derivative will become smaller and smaller, which in turn results in you taking smaller and smaller steps. To combat both this slow-down as well as the overshooting problem, in practical application

*ɛ*is chosen adaptively.

## An excursion to higher dimensions

But isn’t reality much more complicated, when we have not one parameter but typically hundreds or thousands? No, not really. In fact, it is almost surprisingly easy to generalize the approach described in the previous section from one to arbitrary many dimensions.

*f*(

*x*) to two-parameter functions

*f*(

*x*,

*y*). Just like the plot of

*f*(

*x*) was two dimensional (one dimension for input

*x*, one for output

*f*(

*x*)), the plot of

*f*(

*x*,

*y*) is three dimensional (two inputs + one output). You can visualize it like a landscape and think of

*f*(

*x*,

*y*) as a height-map.

*„Landscape“ defined by a two-dimensional function f(x,y).*

*x*– and

*y*-axes) onto the surface, as is shown in the plot on the left. Each of these lines, corresponds to a vertical cut through the landscape and is itself a one-dimensional function.

*y*-axis at

*x*=-0.5 can be seen as the function

*g*(

*y*)=

*f*(-0.5,

*y*) with constant

*x*=-0.5. That means that at any point (

*x*₀,

*y*₀), we can slice the 2-D function into two 1-D functions. And we already know how to get the slope of a 1-D function! We simply calculate its derivative.

*Vertical cuts through the f(x,y)-landscape at point (-0.5, 0.5), parallel to the x-axis (left), and parallel to the y-axis (right).*

But first, we have to refine our notation a bit. Now that there is more than one variable, the notation *f* ′ has become ambiguous, as it is unclear whether the derivative is with respect to *x* or *y*. To disambiguate, there is a more general notation. A derivative with respect to *x* is denoted

and the derivative with respect to *y* is

The symbol ∂ goes by many names, such as „die“, „doh“ or – my favorite – „dabba“, to name a few. But since it is a stylized letter d, you can simply read it „dee“ and the derivative with respect to *x* as „dee eff dee ex“. If you are not used to multivariate calculus, it might be confusing how suddenly a fraction is used to generalize a simple prime (′) symbol. This comes quite naturally though because, as we saw in the previous section, the derivative *is* a fraction:

You can interpret the numerator *∂f* as the difference in *f* and the denominator *∂x* as the difference in *x*, in the frame of the limit process.

Having settled on a notation, how do we combine the two 1-D derivatives back to a two-dimensional entity, so that we can tell in which direction to change our parameters (*x*,*y*)?

It is one of the great wonders of calculus that, as long as you make small enough steps, you do not even need to combine them at all. If you make a small step along the *y*-direction exactly as you would in the 1-D case, that is,

and another small step in the *x*-direction, again, exactly as you would in 1-D, that is,

you have effectively made a 2-D move into the correct direction, straight toward the minimum. Now what is *small enough* with regards to step size? Strictly speaking, two 1-D steps are the same as one 2-D step only if the steps are what mathematicians call *infinitesimal*, that is, *infinitely* small. However, we are pretty much on the safe side, here, as the error from neglecting the interaction of *x*– and *y*-dimensions with finite steps scales benignly with the step size. This means that when you increase the step size, you will feel the errors from the single-parameter derivatives long before the mixed-dimension errors become significant (because the former are of order 𝒪(*ɛ*) while the latter are of order 𝒪(*ɛ*²) and* ɛ *is small).

Of course, there is a better, more compact way to express that you are actually doing the same for each dimension. First, you can define a shorthand for *x* and *y*:

Second, the collection of all 1-D derivatives of a function, the *gradient*, is denoted by *∇f* which in 2-D is defined as

So you can write your two 1-D optimization steps together as one 2-D step:

Now that we have seen how to perfectly decompose a 2-D problem into two 1-D problems, what is going to stop us from going to 3 dimensions, 42 dimensions, or any other conceivable number of dimensions? Nothing will!

Consider any positive integer dimension *N.* Using again a shorthand for the variables

the slope in *N* dimensions can be written as the combination of *N* one-dimensional slopes:

Moreover, written this way, the optimization step in *N*-D looks exactly like in 2-D:

At this point maybe you think that the mechanics of calculating high-dimensional derivatives is not that difficult, after all, but you’re still not quite comfortable with actually *thinking* in *N*-dimensional spaces. If so, do not worry, you are in good company! Just do as Geoffry Hinton, one of the doyens of AI, recommends:

*To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say ‚fourteen‘ to yourself very loudly. Everyone does it. *

## Differentiating some code

Now we have been talking quite a bit about derivatives of mathematical functions in one or many dimensions. But what is the derivative of a piece of computer code? Let us have a look at some concrete programming languages and calculate some actual code derivatives. To follow, you should either be familiar with the derivatives of elementary functions such as x², sin(x), and exp(x) or have a reference at hand.

Consider, for example, this simple function implemented in Python:

This function implements the mathematical function *f*(*x*)=3*x*. In calculus class it would be the easiest task to calculate its derivative: *f* ′(*x*)=3. So how would the best possible implementation of the derivative look like in Python? How about this:

(Here, **fpr** is short for „f prime“.) This can be justly called the best possible implementation. It calculates the value of the derivative to machine accuracy. One could even argue that this derivative is actually *exact* and that only its evaluation at a given *x* introduces an approximation to machine accuracy.

Let us look at another example, this time in Scala:

Again, consider the mathematical function implemented here: *g*(*x*) = sin(*x*). The derivative of this important but elementary function is simply *g* ′(*x*) = cos(*x*). Therefore, the derivative of our Scala function **g(x: Double): Double** should be this:

Here is one final example, this time in C, to make sure we got this right:

We need a function **float hpr(float x).** How would it look like? Starting again from the actual mathematical function implemented in the code

and applying basic differentiation rules, we obtain

That can be readily implemented in C again!

Easy enough, the derivative of code that implements a mathematical function is the implementation of said function’s derivative.

## Complicated functions are but chains of simple ones

So far, there is no need for automation. The derivatives in the previous section’s examples could simply be implemented by hand. But how about more complicated functions? Every function, no matter how complicated, is made up of simple operations. To illustrate how to go from simple to complicated, let’s start with

The function *h(x*) can be seen as the result of subsequent application of two functions

as can be seen here:

Creating new functions by inserting the result of one function into another is known as *function composition*. There is a special notation that saves you from getting lost in all those brackets:

Let’s take our composition one step further and plug *h*(*x*) into yet another function

The result would be

Of course, we could continue this game ad infinitum, composing together more and more functions.

Interestingly, the fact that a function can be composed of elementary functions, allows us to also compose its derivative from the derivative of these elementary functions. You probably remember this as another concept from high-school calculus, known as the *chain rule*:

If *g*(*x*) is itself a composition, the chain rule can be applied recursively, to differentiate arbitrarily long composition chains.

Consequently, as long as we know how to differentiate a relatively small set of elementary functions such as sin(*x*), exp(*x*), log(*x*) and *xⁿ* and basic operations such as sums and products of functions, the chain rule provides us with a simple recipe to compute derivatives of any function, no matter how complicated.

You could now use this recipe to calculate the derivative of any function manually, for example on a piece of paper, and then implement the result in code. Mathematically, there’s nothing wrong with that approach. Practically, though, even if you only have a moderately complicated function, for example a product of some dozen factors, each composed of a handful of chained elementary functions, this process becomes tedious and error-prone.

For serious applications as, for example, the training of a DNN made up of thousands of elementary operations, the manual approach would be ridiculous. This begs the age-old question that always comes up when facing huge amounts of tedious work: Can’t someone else do it?

Yes, the computer can! To have the computer differentiate any differentiable function, we just have to teach it how to differentiate simple operations and the chain rule – and we’re done!

Application of the chain rule is a purely mechanical process that we can easily turn into an algorithm and hand it off to the computer.

This is the idea behind *automatic* differentiation. It can be realized in various ways, for example using a preprocessor that produces the actual code of the differentiated functions, just like we did manually in our example, above. The more common approach watches your program at runtime and records the sequence (or rather graph) of elementary operations on a so-called *tape* and then computes the derivative of the taped code, dynamically.

## Conclusion

- How automatic differentiation (AD) works
- What’s forward-mode and reverse-mode (backpropagation) automatic differentiation (AD)
- Why gradient descent
*needs*reverse-mode AD - How to compute derivatives of for-loops, conditionals, and other control structures
- Application of differentiable code besides classical Machine Learning

In the meanwhile, you might want to check out the following excellent presentations for further insights into the world of differentiable code:

- Ryan P. Adams, You Should Be Using Automatic Differentiation, 2016
- Matthew J. Johnson, Automatic Differentiation, Deep Learning Summer School Montreal 2017
- Noel Welsh, Differentiable Functional Programming, Scala Days 2018
- Viral B. Shah, Differentiable Programming — An exciting generalization of Deep neural networks, PyData Delhi 2019