Seite wählen
High-School Math for Machines: Differentiable Code

The conceptual framework behind gradient descent, backpropagation, and Software 2.0.

Dr. Michael Köpf
11. Mai 2020 Differentiable code makes the Machine Learning world go round. PyTorch, TensorFlow, and any other modern Machine Learning framework all rely on it when it comes to the actual learning. Reason enough for anybody even remotely interested in AI to be acquainted with differentiable code. We invite you to follow us through a two-part explanation of the conceptual framework that powers a large part of today’s AI industry. No Ph.D. required!

Differentiable code is vital to Machine Learning because it is the basis of gradient descent, a family of optimization algorithms doing the grunt work in the training of Machine Learning models. The adjective differentiable comes from the noun differentiation, the mathematical process of computing derivatives. Yes, the good old derivatives from your high-school calculus class. Indeed, you get amazingly far with high-school math, when it comes to AI. For example, as we will see below, the gradient in the name gradient descent is essentially a collection of derivatives.

To train even a moderately sizeddeep neuronal network running (DNN) using gradient descent, you have to compute a lot of derivatives. That’s why training efficiency depends on the ability to efficiently differentiate. Enters automatic differentiation (AD). AD is how computers calculate derivatives. AD, and in particular its so-called reverse-mode (aka backpropagation), is the algorithmic driver behind the current AI gold rush. Without it, serious Machine Learning applications would not be feasible — leave alone profitable — even on modern-day cloud infrastructure. In fact, if you see a deep neuronal network running anywhere, it has been trained using reverse-mode AD.

Today’s post will answer the following questions:

• What is Software 2.0 and why might the idea of differentiable code be even bigger than Machine Learning?
• How does my high-school calculus class relate to gradient descent, the work-horse of Machine Learning?
• What’s does it mean to differentiate code? How can one possibly compute a derivative of a piece of, say, Python, Scala, or C code?
• Why can and should differentiation be automated?

Software 2.0, or Fill-in-the-Blanks Programming

With Machine Learning becoming the norm in the late 2010s, AI celebrities such as Yann LeCun or Andrej Karpathy popularized the idea of Software 2.0: If neuronal networks, a special kind of computer programs, can learn their optimal parameters based on sufficient training data, then why don’t we design all programs to optimize themselves based on data describing their correct behavior?

To illustrate this idea, imagine the space of all possible programs – the vast program space, where every possible computer program is represented by a single point. Software 2.0 is when you do not fix your entire program but rather select a subset of program space – a parameterizable class of possible programs – and leave it to the computer to find the optimal choice within this class. This is what Andrej Karpathy calls „fill-in-the-blanks programming“. You define the frame, Software 2.0 paints the picture.

Differentiable programs automatically optimize within a subset of program space. (Inspired by Software 2.0, A. Karpathy, 2017) The program from the tester’s viewpoint.

But how do you explain to a computer what exactly you consider optimalHere, it turns out, computers are not that different from human software developers, after all. Rather than a vague verbal explanation of software requirements, most developers prefer objectively testable specifications, ready to be integrated into an automated test suite. So when we provide a large set of tests to our Software 2.0, the computer systematically tries out various concrete programs, tracing out its subset of program space. For each of these programs it measures how far it is away from an optimal program by how well it satisfies the test suite.
To make this approach work, we have to help the computer a little bit by using a special kind of test, so-called soft unit tests (SUT). A SUT is basically your regular unit test with the only difference that it does not give you a binary response (green light on success, red light on failure) but instead calculates a score to quantify how far you are away from the expectation. Testing addition, for example, a SUT would tell you that 2+2=4.1 is less wrong than 2+2=-11.
From the tester’s viewpoint, the program is a black box converting inputs into outputs. Given some input x, it produces an output f(x). Of course, if you would cut the black box open, you would reveal a (possibly huge) number of internal parameters. In a conventional program, all these parameters would be fixed to definite values. In Software 2.0, however, by way of tuning these parameters, the program „moves“ through program space.
Each of the SUT comprises an input x and an output yThe input is inserted into the program, which produces from it the output f(x). This output is compared to the expected outcome y and receives a good score when both values are approximately equal. The result of the comparison is fed back into the Software 2.0 algorithm where it is used to tune the parameters to incrementally improve the test score. Cutting the black box open reveals its internal parameters. A Software 2.0 program subject to N soft unit tests tunes its internal parameters to match output f(x) and expectation yᵢ.

After a relatively short-lived hype, the term Software 2.0 didn’t become quite the buzzword its inventors had hoped for. However, if you have been anywhere near the Machine Learning crowd in the last decade, you likely have heard of the most relevant subclass of Software 2.0: differentiable code, sometimes written ∂Code („∂“ is the symbol of partial differentiation).

Whereas Software 2.0 describes self-optimizing code in general, differentiable code is code that self-optimizes through gradient descent methods. How does differentiable code help us with the optimization procedure? Naively, we could try out various parameter sets at random, until we are reasonably sure that we can live with the remaining error. But there are at least two major problems with that. First, the full parameterized subset of program space is usually enormous. Even if there’s only a single continuous parameter, we have theoretically infinitely many possible choices. Second, whenever we find a seemingly good solution, how can we ever be sure if there is not an even better solution right next to it?
These are all age-old problems by no means specific to Machine Learning. In mathematics, such optimization problems are all over the place. Standing on the shoulders of giants such as Newton and Leibniz, we can understand most of it surprisingly well with what is high-school math knowledge, today.
To illustrate these problems and their solution, let’s have a first look at differentiable code.

A first differentiable program

(The idea of this section has been adapted from Differentiable Functional Programming, N. Welsh, 2018)

Imagine you’d have to write a fairly simple program. The requirements are as follows: The program has to accept a single floating-point number as an input parameter xFrom this input, it has to produce an output y, another float. You are given 200 soft unit tests describing the required behavior. Each test is defined by a pair (x,y) of one input x and one expected output ŷ.

Since the unit tests are soft, there has to be a measure of how well or bad the program did on each test. For this purpose we introduce the scoring metric The metric 𝓛⁽ⁱ⁾ is the so-called loss of the i-th test. Optimally, it would be zero, because then input x⁽ⁱ⁾ would exactly match expected output ŷ⁽ⁱ⁾. The overall test suite performance metric 𝓛 of our program is then simply the sum of all the 200 𝓛⁽ⁱ⁾, or, to put it as a mathematician’s for-loop, The smaller 𝓛, the better our overall performance. Since the tests are pairs of numbers, we can visualize all 200 in one 2-D plot, with the inputs on the x-axis and the outputs on the y-axis. Let’s assume the plot of your test data looks like the picture below on the left. A plot of 200 soft unit tests. Each point represents a pair of input and expected output.

Now, there is one last major decision to make, before you can hand-off your development task to the algorithm: Choosing a suitable continuous subset of program space to search for the optimum.
Considering the sinusoidal nature of the input-output relation in your test suite, we have a hunch that might be a reasonably good choice. There is only one internal parameter in that function: ω (that’s a lower-case „omega“), denoting the frequency of the sine. Therefore, we are looking at a one-dimensional subset of program space that you can visualize as a line from ω=-∞ to ω=∞, just like the x-axis goes from x=-∞ to x=∞.
Each possibleω-value corresponds to another version of our program f(x). Let’s say we set ω to 12.345, then our program becomes So, let’s give it a try! Let’s choose some ω-values and see if we can’t satisfy all those unit tests. Test suite performance for parameter values ω=0.75, 2.25, 1.0, and 1.5. Left: the program’s outputs (solid line) vs. the expected outputs (dots). In one frame, three orange bars on the left indicate the point-wise error ℒ⁽ⁱ⁾ for three example points. Right: 𝓛-score for each ω-value.

By looking at the animation above, ω=1.5 is by far the best result. But could there be a better solution, at 1.54321, say? In this example, there is not. In reality though, by random choice, it might have taken you dozens or hundreds of attempts to find a sufficiently good solution and you’d probably still not have arrived very close to the correct 1.5.
How can we get out of this dilemma? Think about loss metric 𝓛. What are the parameters, that is, the unknowns, in this function? Despite the 200 xy-pairs, 𝓛 is structurally very simple. Admittedly, for a human it would be a bit unwieldy to write out the full sum of 200 terms, one for each SUT. For a computer, though, a sum with 200 terms is not scarier than a sum of two. What remains is a simple one-parameter function, 𝓛(ω). If you look at its definition above, you see that it is completely determined by the test suite’s x⁽ⁱ⁾- and ŷ⁽ⁱ⁾-values. This means, there is no need for randomized sampling parameter values, at all. You already know the function’s value at every single point! You can easily plot 𝓛(ω), just like any other function: The program’s overall test suite performance (loss score) as function of frequency parameter ω.

So, to find the optimal ω, you simply have to find a minimum of the function 𝓛(ω). Finding minima – wait a minute, didn’t we do that all the time in our high-school calculus class? Indeed. And it had something to do with derivatives. Fortunately, 𝓛(ω) is not only structurally simple but also differentiableThe term differentiable means that you can calculate a derivative of 𝓛(ω). In fact, the whole reason why our little example here is called differentiable code is that 𝓛(ω), the function towards which we want to optimize, is differentiable Let us refresh our calculus memories to see how the derivative can help us to systematically optimize.

Calculus recap

A derivative of a function f(x) at point x measures the local slope of the function. You can view the derivative as a tool that always tells you where’s up and where’s down on a function. Truth is, that’s all you need to know to understand the essential idea of gradient descent and differentiable code.
More formally put, if the derivative with respect to its parameter is greater than 0, it means that you will go uphill by increasing the parameter. If it is smaller than 0, it means that you will go downhill by increasing the parameter. The derivative is the slope of a line connecting two points in the limit of vanishing distance between said points.

To see, how the derivative relates to the slope, let us take a look at its definition. Commonly, the derivative is written ′ (read: „eff prime“) and defined as a mathematical limit of the form This expression has a beautiful geometrical interpretation (see the animation above on the right): The slope is change in height divided by change in position, that is, Δf/Δx.

You can visualize the derivative as the slope of a straight line going from the point (x,f(x)) to (x+Δx,f(x+Δx)) as you reduce the horizontal distance Δx, bringing the points closer and closer together. In the mathematical limit Δx→0  the points practically merge. The slope is then quite naturally defined by   Just as it is natural to describe a slope by saying how much height changes per distance, it is natural to describe the slope of a function by saying how much its value changes per parameter change.

One of the reasons why derivatives are so awfully useful is that at extreme points – minima, the bottoms of your function’s valleys, and maxima, the peaks of your function’s hills – the derivative is always exactly 0. This is directly evident from plotting the slope at the minima and maxima of a function.
And it gets even better! Not only does the derivative tell you, whether you are currently at a minimum or maximum, it also tells you in which direction to go to find one! Direction here means literally along which direction you move along the x-axis, that is, whether you increase or decrease the parameter x. In minima and maxima the slope is zero (horizontal), that is, f′=0.

How does that work? As promised in this section’s introduction, the derivative is a tool that allows you at any point, to know what way is up and what way is down. If you want to maximize the function value, that is, you want to find the highest point in your surroundings, then you go upwards.

What does that mean now, exactly? Consider you are currently at parameter value x₀. There, the local derivative is f′(x₀). If the derivative is larger than zero, it means that by increasing x you go up and by decreasing x you go down. So, what do you do to come closer to your objective, the highest point? You increase x. More mathematically speaking, you choose a new parameter value x by taking a step in the direction of positive slope: where ɛ quantifies your step size. In Machine Learning lingo, the step size is called the learning rate.

This formula automatically covers the opposite case, where the derivative is less than zero. Negative f ′ means that for larger x, you go down, farther away from your objective, the highest point. Therefore, at negative f ′(x₀), going upwards means taking a step towards a smaller x-value.

If you want to minimize the function value, you proceed similarly, only that now you take ɛ-steps in the direction of negative slope (downhill): The process is visualized below.  The derivative is used to iteratively find a local maximum (left) or minimum (right).

The astute reader will have noted several problems with this approach. For one, you will only ever find the minimum or maximum closest to your starting point. In the example above, if you by chance started closer to xₘᵢₙ,₂ than to xₘᵢₙ,₁, minimization using the described method will always bring you to xₘᵢₙ,₂ and never to xₘᵢₙ,₁, although it would be an even lower, that is, better, point.
This is a well-known problem stemming from the local nature of the derivative itself, and it’s a hard one. Every method based on derivatives will always have this shortcoming built-in. The problem is even bigger, in particular in higher dimensions, because besides minima and maxima there are also saddle points, other entities with the f ′=0 property, that will interfere with your quest for an optimum.
Unfortunately, until now there is no method known to humanity to tell, in general, whether or not there is a better point, at all, leave alone how to systematically find it. A large number of tools have been invented to at least mitigate this problem but a general solution might still be lifetimes away.
The second issue is the choice of the step size ɛIf it is too large, you risk overshooting, that is, you might jump over an optimum instead of approaching it. If, on the other hand, it is too small, you might have to take a lot of steps to come close enough to an optimum. Even worse, the step size is actually not ɛ but ɛf ′ and thus proportional to the derivative’s value. Since at the extremum the derivative becomes zero when you are approaching it, the derivative will become smaller and smaller, which in turn results in you taking smaller and smaller steps. To combat both this slow-down as well as the overshooting problem, in practical application ɛ is chosen adaptively.

An excursion to higher dimensions

But isn’t reality much more complicated, when we have not one parameter but typically hundreds or thousands? No, not really. In fact, it is almost surprisingly easy to generalize the approach described in the previous section from one to arbitrary many dimensions.

Let us start  by going from one to two dimensions. That’s equivalent to going from single-parameter functions f(x) to two-parameter functions f(x,y). Just like the plot of f(x) was two dimensional (one dimension for input x, one for output f(x)), the plot of f(x,y) is three dimensional (two inputs + one output). You can visualize it like a landscape and think of f(x,y) as a height-map. „Landscape“ defined by a two-dimensional function f(x,y).

We are trying to do the same thing we did in 1-D before, that is, to find a local optimum. Imagine we are looking for a local minimum. Our logic from the 1-D case still applies perfectly: To approach the next local minimum, one should walk downhill. But for that we would need to know the slope. How do we calculate the slope in 2-D?
Fortunately, the 2-D problem is simply a combination of two 1-D problems. Imagine you paint coordinate lines (lines parallel to the x– and y-axes) onto the surface, as is shown in the plot on the left. Each of these lines, corresponds to a vertical cut through the landscape and is itself a one-dimensional function.
For example, the line parallel to the y-axis at x=-0.5 can be seen as the function g(y)= f(-0.5, y) with constant x=-0.5. That means that at any point (x, y), we can slice the 2-D function into two 1-D functions. And we already know how to get the slope of a 1-D function! We simply calculate its derivative.  Vertical cuts through the f(x,y)-landscape at point (-0.5, 0.5), parallel to the x-axis (left), and parallel to the y-axis (right).

But first, we have to refine our notation a bit. Now that there is more than one variable, the notation f ′ has become ambiguous, as it is unclear whether the derivative is with respect to x or y. To disambiguate, there is a more general notation. A derivative with respect to x is denoted and the derivative with respect to y is The symbol ∂ goes by many names, such as „die“, „doh“ or – my favorite – „dabba“, to name a few. But since it is a stylized letter d, you can simply read it „dee“ and the derivative with respect to x as „dee eff dee ex“. If you are not used to multivariate calculus, it might be confusing how suddenly a fraction is used to generalize a simple prime (′) symbol. This comes quite naturally though because, as we saw in the previous section, the derivative is a fraction: You can interpret the numerator ∂f as the difference in f and the denominator ∂x as the difference in x, in the frame of the limit process.

Having settled on a notation, how do we combine the two 1-D derivatives back to a two-dimensional entity, so that we can tell in which direction to change our parameters (x,y)?

It is one of the great wonders of calculus that, as long as you make small enough steps, you do not even need to combine them at all. If you make a small step along the y-direction exactly as you would in the 1-D case, that is, and another small step in the x-direction, again, exactly as you would in 1-D, that is, you have effectively made a 2-D move into the correct direction, straight toward the minimum. Now what is small enough with regards to step size? Strictly speaking, two 1-D steps are the same as one 2-D step only if the steps are what mathematicians call infinitesimal, that is, infinitely small. However, we are pretty much on the safe side, here, as the error from neglecting the interaction of x– and y-dimensions with finite steps scales benignly with the step size. This means that when you increase the step size, you will feel the errors from the single-parameter derivatives long before the mixed-dimension errors become significant (because the former are of order 𝒪(ɛ) while the latter are of order 𝒪(ɛ²) and ɛ is small).

Of course, there is a better, more compact way to express that you are actually doing the same for each dimension. First, you can define a shorthand for x and y: Second, the collection of all 1-D derivatives of a function, the gradient, is denoted by ∇f which in 2-D is defined as So you can write your two 1-D optimization steps together as one 2-D step: Now that we have seen how to perfectly decompose a 2-D problem into two 1-D problems, what is going to stop us from going to 3 dimensions, 42 dimensions, or any other conceivable number of dimensions? Nothing will!

Consider any positive integer dimension N. Using again a shorthand for the variables the slope in N dimensions can be written as the combination of N one-dimensional slopes: Moreover, written this way, the optimization step in N-D looks exactly like in 2-D: At this point maybe you think that the mechanics of calculating high-dimensional derivatives is not that difficult, after all, but you’re still not quite comfortable with actually thinking in N-dimensional spaces. If so, do not worry, you are in good company! Just do as Geoffry Hinton, one of the doyens of AI, recommends:

To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say ‚fourteen‘ to yourself very loudly. Everyone does it.

Geoffry Hinton

in his archived Coursera MOOC, Lecture 2c: A geometrical view of perceptrons (around minute 1:00)

To understand everything about gradients you can always resort to thinking of a 3-D landscape like the one from our example, above, and gradient descent is nothing but walking down the hills of this landscape.

Differentiating some code

Now we have been talking quite a bit about derivatives of mathematical functions in one or many dimensions.  But what is the derivative of a piece of computer code? Let us have a look at some concrete programming languages and calculate some actual code derivatives. To follow, you should either be familiar with the derivatives of elementary functions such as x², sin(x), and exp(x) or have a reference at hand.

Consider, for example, this simple function implemented in Python: This function implements the mathematical function f(x)=3xIn calculus class it would be the easiest task to calculate its derivative: f ′(x)=3. So how would the best possible implementation of the derivative look like in Python? How about this: (Here, fpr is short for „f prime“.) This can be justly called the best possible implementation. It calculates the value of the derivative to machine accuracy. One could even argue that this derivative is actually exact and that only its evaluation at a given x introduces an approximation to machine accuracy.

Let us look at another example, this time in Scala: Again, consider the mathematical function implemented here: g(x) = sin(x). The derivative of this important but elementary function is simply g ′(x) = cos(x). Therefore, the derivative of our Scala function g(x: Double): Double should be this: Here is one final example, this time in C, to make sure we got this right: We need a function float hpr(float x). How would it look like? Starting again from the actual mathematical function implemented in the code and applying basic differentiation rules, we obtain That can be readily implemented in C again! Easy enough, the derivative of code that implements a mathematical function is the implementation of said function’s derivative.

Complicated functions are but chains of simple ones

So far, there is no need for automation. The derivatives in the previous section’s examples could simply be implemented by hand. But how about more complicated functions? Every function, no matter how complicated, is made up of simple operations. To illustrate how to go from simple to complicated, let’s start with The function h(x) can be seen as the result of subsequent application of two functions as can be seen here: Creating new functions by inserting the result of one function into another is known as function composition. There is a special notation that saves you from getting lost in all those brackets: Let’s take our composition one step further and plug h(x) into yet another function The result would be Of course, we could continue this game ad infinitum, composing together more and more functions. Interestingly, the fact that a function can be composed of elementary functions, allows us to also compose its derivative from the derivative of these elementary functions. You probably remember this as another concept from high-school calculus, known as the If g(x) is itself a composition, the chain rule can be applied recursively, to differentiate arbitrarily long composition chains.

Consequently, as long as we know how to differentiate a relatively small set of elementary functions such as sin(x), exp(x), log(x) and xⁿ and basic operations such as sums and products of functions, the chain rule provides us with a simple recipe to compute derivatives of any function, no matter how complicated.

You could now use this recipe to calculate the derivative of any function manually, for example on a piece of paper, and then implement the result in code. Mathematically, there’s nothing wrong with that approach. Practically, though, even if you only have a moderately complicated function, for example a product of some dozen factors, each composed of a handful of chained elementary functions, this process becomes tedious and error-prone.
For serious applications as, for example, the training of a DNN made up of thousands of elementary operations, the manual approach would be ridiculous. This begs the age-old question that always comes up when facing huge amounts of tedious work: Can’t someone else do it?
Yes, the computer can! To have the computer differentiate any differentiable function, we just have to teach it how to differentiate simple operations and the chain rule – and we’re done!

Application of the chain rule is a purely mechanical process that we can easily turn into an algorithm and hand it off to the computer.
This is the idea behind automatic differentiation. It can be realized in various ways, for example using a preprocessor that produces the actual code of the differentiated functions, just like we did manually in our example, above. The more common approach watches your program at runtime and records the sequence (or rather graph) of elementary operations on a so-called tape and then computes the derivative of the taped code, dynamically.

Conclusion

This concludes today’s blog post. We have learned about the concept of Software 2.0 and its currently most important incarnation, differentiable code. After refreshing our high-school calculus knowledge, we have seen how to employ derivatives to iteratively optimize with respect to a scoring function, that is, we have learned the essence of gradient descent. Looking at examples, we have seen that the derivative of code that implements a mathematical function is the implementation of said function’s derivative. Finally, we showed how the chain rule can be used to turn differentiation of arbitrary functions into a purely mechanical process, opening up the way for automatic differentiation. I hope you enjoyed this blog post and maybe even learned something new.
In a  follow-up post we will dive deeper into the algorithmic details of differentiable code and focus on the following points:
• How automatic differentiation (AD) works
• What’s forward-mode and reverse-mode (backpropagation) automatic differentiation (AD) 