The conceptual framework behind gradient descent, backpropagation, and Software 2.0.
Differentiable code makes the Machine Learning world go round. PyTorch, TensorFlow, and any other modern Machine Learning framework all rely on it when it comes to the actual learning. Reason enough for anybody even remotely interested in AI to be acquainted with differentiable code. We invite you to follow us through a two-part explanation of the conceptual framework that powers a large part of today’s AI industry. No Ph.D. required!
Differentiable code is vital to Machine Learning because it is the basis of gradient descent, a family of optimization algorithms doing the grunt work in the training of Machine Learning models. The adjective differentiable comes from the noun differentiation, the mathematical process of computing derivatives. Yes, the good old derivatives from your high-school calculus class. Indeed, you get amazingly far with high-school math, when it comes to AI. For example, as we will see below, the gradient in the name gradient descent is essentially a collection of derivatives.
To train even a moderately sizeddeep neuronal network running (DNN) using gradient descent, you have to compute a lot of derivatives. That’s why training efficiency depends on the ability to efficiently differentiate. Enters automatic differentiation (AD). AD is how computers calculate derivatives. AD, and in particular its so-called reverse-mode (aka backpropagation), is the algorithmic driver behind the current AI gold rush. Without it, serious Machine Learning applications would not be feasible — leave alone profitable — even on modern-day cloud infrastructure. In fact, if you see a deep neuronal network running anywhere, it has been trained using reverse-mode AD.
Today’s post will answer the following questions:
- What is Software 2.0 and why might the idea of differentiable code be even bigger than Machine Learning?
- How does my high-school calculus class relate to gradient descent, the work-horse of Machine Learning?
- What’s does it mean to differentiate code? How can one possibly compute a derivative of a piece of, say, Python, Scala, or C code?
- Why can and should differentiation be automated?
Software 2.0, or Fill-in-the-Blanks Programming
With Machine Learning becoming the norm in the late 2010s, AI celebrities such as Yann LeCun or Andrej Karpathy popularized the idea of Software 2.0: If neuronal networks, a special kind of computer programs, can learn their optimal parameters based on sufficient training data, then why don’t we design all programs to optimize themselves based on data describing their correct behavior?
To illustrate this idea, imagine the space of all possible programs – the vast program space, where every possible computer program is represented by a single point. Software 2.0 is when you do not fix your entire program but rather select a subset of program space – a parameterizable class of possible programs – and leave it to the computer to find the optimal choice within this class. This is what Andrej Karpathy calls „fill-in-the-blanks programming“. You define the frame, Software 2.0 paints the picture.
The program from the tester’s viewpoint.
Cutting the black box open reveals its internal parameters.
A Software 2.0 program subject to N soft unit tests tunes its internal parameters to match output f(xᵢ) and expectation yᵢ.
After a relatively short-lived hype, the term Software 2.0 didn’t become quite the buzzword its inventors had hoped for. However, if you have been anywhere near the Machine Learning crowd in the last decade, you likely have heard of the most relevant subclass of Software 2.0: differentiable code, sometimes written ∂Code („∂“ is the symbol of partial differentiation).
A first differentiable program
(The idea of this section has been adapted from Differentiable Functional Programming, N. Welsh, 2018)
Imagine you’d have to write a fairly simple program. The requirements are as follows: The program has to accept a single floating-point number as an input parameter x. From this input, it has to produce an output y, another float. You are given 200 soft unit tests describing the required behavior. Each test is defined by a pair (x,y) of one input x and one expected output ŷ.
Since the unit tests are soft, there has to be a measure of how well or bad the program did on each test. For this purpose we introduce the scoring metric
The metric 𝓛⁽ⁱ⁾ is the so-called loss of the i-th test. Optimally, it would be zero, because then input x⁽ⁱ⁾ would exactly match expected output ŷ⁽ⁱ⁾. The overall test suite performance metric 𝓛 of our program is then simply the sum of all the 200 𝓛⁽ⁱ⁾, or, to put it as a mathematician’s for-loop,
The smaller 𝓛, the better our overall performance. Since the tests are pairs of numbers, we can visualize all 200 in one 2-D plot, with the inputs on the x-axis and the outputs on the y-axis. Let’s assume the plot of your test data looks like the picture below on the left.
A plot of 200 soft unit tests. Each point represents a pair of input and expected output.
Test suite performance for parameter values ω=0.75, 2.25, 1.0, and 1.5. Left: the program’s outputs (solid line) vs. the expected outputs (dots). In one frame, three orange bars on the left indicate the point-wise error ℒ⁽ⁱ⁾ for three example points. Right: 𝓛-score for each ω-value.
The program’s overall test suite performance (loss score) as function of frequency parameter ω.
The derivative is the slope of a line connecting two points in the limit of vanishing distance between said points.
To see, how the derivative relates to the slope, let us take a look at its definition. Commonly, the derivative is written f ′ (read: „eff prime“) and defined as a mathematical limit of the form
This expression has a beautiful geometrical interpretation (see the animation above on the right):
The slope is change in height divided by change in position, that is, Δf/Δx.
Just as it is natural to describe a slope by saying how much height changes per distance, it is natural to describe the slope of a function by saying how much its value changes per parameter change.
In minima and maxima the slope is zero (horizontal), that is, f′=0.
What does that mean now, exactly? Consider you are currently at parameter value x₀. There, the local derivative is f′(x₀). If the derivative is larger than zero, it means that by increasing x you go up and by decreasing x you go down. So, what do you do to come closer to your objective, the highest point? You increase x. More mathematically speaking, you choose a new parameter value x by taking a step in the direction of positive slope:
where ɛ quantifies your step size. In Machine Learning lingo, the step size is called the learning rate.
This formula automatically covers the opposite case, where the derivative is less than zero. Negative f ′ means that for larger x, you go down, farther away from your objective, the highest point. Therefore, at negative f ′(x₀), going upwards means taking a step towards a smaller x-value.
If you want to minimize the function value, you proceed similarly, only that now you take ɛ-steps in the direction of negative slope (downhill):
The process is visualized below.
The derivative is used to iteratively find a local maximum (left) or minimum (right).
An excursion to higher dimensions
But isn’t reality much more complicated, when we have not one parameter but typically hundreds or thousands? No, not really. In fact, it is almost surprisingly easy to generalize the approach described in the previous section from one to arbitrary many dimensions.
„Landscape“ defined by a two-dimensional function f(x,y).
Vertical cuts through the f(x,y)-landscape at point (-0.5, 0.5), parallel to the x-axis (left), and parallel to the y-axis (right).
But first, we have to refine our notation a bit. Now that there is more than one variable, the notation f ′ has become ambiguous, as it is unclear whether the derivative is with respect to x or y. To disambiguate, there is a more general notation. A derivative with respect to x is denoted
and the derivative with respect to y is
The symbol ∂ goes by many names, such as „die“, „doh“ or – my favorite – „dabba“, to name a few. But since it is a stylized letter d, you can simply read it „dee“ and the derivative with respect to x as „dee eff dee ex“. If you are not used to multivariate calculus, it might be confusing how suddenly a fraction is used to generalize a simple prime (′) symbol. This comes quite naturally though because, as we saw in the previous section, the derivative is a fraction:
You can interpret the numerator ∂f as the difference in f and the denominator ∂x as the difference in x, in the frame of the limit process.
Having settled on a notation, how do we combine the two 1-D derivatives back to a two-dimensional entity, so that we can tell in which direction to change our parameters (x,y)?
It is one of the great wonders of calculus that, as long as you make small enough steps, you do not even need to combine them at all. If you make a small step along the y-direction exactly as you would in the 1-D case, that is,
and another small step in the x-direction, again, exactly as you would in 1-D, that is,
you have effectively made a 2-D move into the correct direction, straight toward the minimum. Now what is small enough with regards to step size? Strictly speaking, two 1-D steps are the same as one 2-D step only if the steps are what mathematicians call infinitesimal, that is, infinitely small. However, we are pretty much on the safe side, here, as the error from neglecting the interaction of x– and y-dimensions with finite steps scales benignly with the step size. This means that when you increase the step size, you will feel the errors from the single-parameter derivatives long before the mixed-dimension errors become significant (because the former are of order 𝒪(ɛ) while the latter are of order 𝒪(ɛ²) and ɛ is small).
Of course, there is a better, more compact way to express that you are actually doing the same for each dimension. First, you can define a shorthand for x and y:
Second, the collection of all 1-D derivatives of a function, the gradient, is denoted by ∇f which in 2-D is defined as
So you can write your two 1-D optimization steps together as one 2-D step:
Now that we have seen how to perfectly decompose a 2-D problem into two 1-D problems, what is going to stop us from going to 3 dimensions, 42 dimensions, or any other conceivable number of dimensions? Nothing will!
Consider any positive integer dimension N. Using again a shorthand for the variables
the slope in N dimensions can be written as the combination of N one-dimensional slopes:
Moreover, written this way, the optimization step in N-D looks exactly like in 2-D:
At this point maybe you think that the mechanics of calculating high-dimensional derivatives is not that difficult, after all, but you’re still not quite comfortable with actually thinking in N-dimensional spaces. If so, do not worry, you are in good company! Just do as Geoffry Hinton, one of the doyens of AI, recommends:
To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say ‚fourteen‘ to yourself very loudly. Everyone does it.
Differentiating some code
Now we have been talking quite a bit about derivatives of mathematical functions in one or many dimensions. But what is the derivative of a piece of computer code? Let us have a look at some concrete programming languages and calculate some actual code derivatives. To follow, you should either be familiar with the derivatives of elementary functions such as x², sin(x), and exp(x) or have a reference at hand.
Consider, for example, this simple function implemented in Python:
This function implements the mathematical function f(x)=3x. In calculus class it would be the easiest task to calculate its derivative: f ′(x)=3. So how would the best possible implementation of the derivative look like in Python? How about this:
(Here, fpr is short for „f prime“.) This can be justly called the best possible implementation. It calculates the value of the derivative to machine accuracy. One could even argue that this derivative is actually exact and that only its evaluation at a given x introduces an approximation to machine accuracy.
Let us look at another example, this time in Scala:
Again, consider the mathematical function implemented here: g(x) = sin(x). The derivative of this important but elementary function is simply g ′(x) = cos(x). Therefore, the derivative of our Scala function g(x: Double): Double should be this:
Here is one final example, this time in C, to make sure we got this right:
We need a function float hpr(float x). How would it look like? Starting again from the actual mathematical function implemented in the code
and applying basic differentiation rules, we obtain
That can be readily implemented in C again!
Easy enough, the derivative of code that implements a mathematical function is the implementation of said function’s derivative.
Complicated functions are but chains of simple ones
So far, there is no need for automation. The derivatives in the previous section’s examples could simply be implemented by hand. But how about more complicated functions? Every function, no matter how complicated, is made up of simple operations. To illustrate how to go from simple to complicated, let’s start with
The function h(x) can be seen as the result of subsequent application of two functions
as can be seen here:
Creating new functions by inserting the result of one function into another is known as function composition. There is a special notation that saves you from getting lost in all those brackets:
Let’s take our composition one step further and plug h(x) into yet another function
The result would be
Of course, we could continue this game ad infinitum, composing together more and more functions.
Interestingly, the fact that a function can be composed of elementary functions, allows us to also compose its derivative from the derivative of these elementary functions. You probably remember this as another concept from high-school calculus, known as the chain rule:
If g(x) is itself a composition, the chain rule can be applied recursively, to differentiate arbitrarily long composition chains.
Consequently, as long as we know how to differentiate a relatively small set of elementary functions such as sin(x), exp(x), log(x) and xⁿ and basic operations such as sums and products of functions, the chain rule provides us with a simple recipe to compute derivatives of any function, no matter how complicated.
You could now use this recipe to calculate the derivative of any function manually, for example on a piece of paper, and then implement the result in code. Mathematically, there’s nothing wrong with that approach. Practically, though, even if you only have a moderately complicated function, for example a product of some dozen factors, each composed of a handful of chained elementary functions, this process becomes tedious and error-prone.
For serious applications as, for example, the training of a DNN made up of thousands of elementary operations, the manual approach would be ridiculous. This begs the age-old question that always comes up when facing huge amounts of tedious work: Can’t someone else do it?
Yes, the computer can! To have the computer differentiate any differentiable function, we just have to teach it how to differentiate simple operations and the chain rule – and we’re done!
Application of the chain rule is a purely mechanical process that we can easily turn into an algorithm and hand it off to the computer.
This is the idea behind automatic differentiation. It can be realized in various ways, for example using a preprocessor that produces the actual code of the differentiated functions, just like we did manually in our example, above. The more common approach watches your program at runtime and records the sequence (or rather graph) of elementary operations on a so-called tape and then computes the derivative of the taped code, dynamically.
- How automatic differentiation (AD) works
- What’s forward-mode and reverse-mode (backpropagation) automatic differentiation (AD)
- Why gradient descent needs reverse-mode AD
- How to compute derivatives of for-loops, conditionals, and other control structures
- Application of differentiable code besides classical Machine Learning
In the meanwhile, you might want to check out the following excellent presentations for further insights into the world of differentiable code:
- Ryan P. Adams, You Should Be Using Automatic Differentiation, 2016
- Matthew J. Johnson, Automatic Differentiation, Deep Learning Summer School Montreal 2017
- Noel Welsh, Differentiable Functional Programming, Scala Days 2018
- Viral B. Shah, Differentiable Programming — An exciting generalization of Deep neural networks, PyData Delhi 2019