Non-Differentiable Functions in Machine Learning

Try this before you do something fancy.

This post on fretting over non-differentiable functions is mostly relevant if you are building underparametrized models that pivot on making the right assumptions about the data that is being modeled. I’ve seen this be called the “classic” paradigm of machine learning. It is less relevant in the “modern” paradigm of overparametrized deep learning models that have almost become synonymous with machine learning. Large deep learning models turn out to be relatively forgiving with respect to functions that aren’t differentiable everywhere. Overparametrization has a flattening effect on the loss function, and a complex landscape with many minima is considered a feature and not a bug. The capacity of these models also relaxes the requirement for their structure to match the data-generating process, so that the emphasis in designing them rests more on empirical observations and heuristics, and less on trying to express a theory for the specific problem at hand. I would argue that the problems where you are forced to use smaller models are much more common in practice.

Tailoring smaller models to data can mean using awkward shapes like hinges or break points. Alternatively, you may be using a deep learning framework to solve optimization problems, in which case such shapes emerge when contraints are applied. The functions that give trouble in that context will be those like $\mathrm{max}(x)$, $\mathrm{abs}(x)$ or $\mathrm{sgn}(x)$ which are very commonplace and expressive, but aren’t differentiable everywhere.

That an undefined derivative gives issues with gradient descent might be expected, but is actually something that deep learning frameworks like pytorch can often automatically handle using subgradients. A more significant issue is that functions like this easily generate unpleasant loss functions. For example, even something innocent like $f(x) = |x-a| + |x-b|$, which is $L1$ loss for two datapoints in one dimension, doesn’t have a unique minimum, but is minimal over a contiguous subset $x_{min} \in [a,b]$. Minimizing this using a standard gradient descent framework can give wildly varying values of $x$ each time you train for the exact same training data. For a large number of points, the chances of perfect flatness are lower, but compared to e.g. the least square differences, the curvature near the minimum is so flat that training may still be unstable. Adding to that, the 1D example is generous: in higher dimensions intuition fails fast, and instead of a minimum, the flat region could be a plateau that training gets stuck on. In practice it won’t be clear whether the problem is with the model or whether the data is just too noisy.

flat_loss_functions Top: 1-Dimensional L1 loss for two points $a=0.25$ and $b=0.75$. Bottom: 1-Dimensional L1 (absolute value) and L2 (least squares) loss for a large number of points sampled from a uniform distribution $x_i\sim\mathbb{U}(0,1)$.

This type of thing is the bread and butter of the (convex) optimization crowd, and to take matters seriously, there is no way around e.g. (Boyd & Vandenberghe, 2004), for which there are also excellent lectures online (70% math, 30% Boyd jokes!). A lot of work is going into this in machine learning as well, both in terms of theory and in terms of implementation (a google search leads down more than one rabbit hole), but the discussion normally isn’t spelled out as part of the standard curriculum. (Too busy fitting k-means clustering if the standard introductory courses are anything to go by!) The topic of non-differentiable functions in machine learning also goes deep in the sense that many settings aren’t conducive to straight-forward optimization with gradient descent (things like reinforcement learning, GANs, probabilistic models fit with MCMC, non-convex optimization tackled with techniques like genetic algorithms and so on and on – getting to use gradient methods is a win in and off of itself).

The point of this post is to sideline all those deeper discussions. The non-differentiable functions that come up in practice may have smooth approximations that can be gradually sharpened to approach the exact case. An approximate model will usually be equivalent to the exact case for the purpose of fitting data, and it is possible to build intuition by observing the behavior of the model as the discontinuity is removed and replaced. In a case like the introductory example with $L1$ (absolute value) loss, the approximation can re-introduce a little bit of curvative into the loss function that stabilizes the model. As a considerable bonus, removing the non-differentiable point allows for the use of efficient second-order optimization methods like Newton’s method (some software handles this even so, but that is another story).

Below are some approximations of common non-differentiable elements, roughly in the order in which I find them useful. Approximations, in general, are not unique, and other functions can be found that may be more suitable.

It should also be noted that there is a relatively general approach to producing smooth approximations, which is to use mollifiers (though I’ve never actually heard the term “mollifier” used in anger). The basic idea is that the convolution of a function with a dirac delta function gives you back that original function (think: identity operation), and the convolution of a function with a smooth approximation to the dirac delta function gives you back a smooth approximation to that function. On $\mathbb{R}^n$, convolution involves integration. The usual, but not the only, way to approximate a dirac delta function is a Gaussian with variance approaching zero (also listed below). So, if you need something custom built, and finding the convolution of your function with, for example, a Gaussian sounds like a good idea, then this is a great way to go.

The Heaviside Step Function, $\Theta(x-a)$, with Sigmoids

This is the building block for building any piecewise continuous function.

$$ \begin{equation} \Theta\left(x-x_0\right) = \left\{\begin{array}{ll} 0 & \text{for } x < x_0\\ 0.5 & \text{if } x = x_0\\ 1 & \text{for } x > x_0 \end{array}\right. \end{equation} $$

The step function isn’t always defined to have $\Theta(0) = 0.5$ at the break point “$x_0$”, but that normally won’t make a difference when fitting to fuzzy data. The common way to approximate the step function is to use any kind of sigmoid, including the most famous sigmoid, the logistic function.

$$ \sigma\left(x-x_0; \epsilon\right) = \frac{1}{1+e^{-\frac{\left(x-x_0\right)}{\epsilon}}} $$

Physicists will recognize this function as the Fermi-Dirac distribution, only that the $x$-axis is flipped. The breakpoint $x_0$ corresponds to the Fermi Energy, and $\epsilon$ is the Boltzman-scaled temperature $kT$. When $\epsilon$, or, if you wish, the temperature, is zero, the approximation becomes exact.

heaviside Heaviside step function $\Theta(x-x_0)$ and approximation with a logistic function $\sigma(x-x_0;\epsilon)$ for $x_0=0$ and different values of $\epsilon$

Step functions are enough to build a lot of useful things. For example:

Top-Hat Function or Square Well, $T(x-x_0;a,b)$

$$ \begin{equation} T\left(x-x_0; a,b\right) = \left\{\begin{array}{ll} 0 & \text{for } x < a;\ x > b\\ 0.5 & \text{if } x \in \{a,b\}\\ 1 & \text{for } a < x < b \end{array}\right. \end{equation} $$

For the sake of one example of how step functions can be used to build useful piecewise linear things, a top hat function can be built with two step functions as $T\left(x-x_0; a,b\right) = \Theta(x-a) - \Theta(x-b)$. By flipping the sign, you can make a square well for your loss function that allows you to effectively perform constrained optimization without unveiling e.g. convex optimization code.

tophat Tophat function $T(x;a,b)$ with boundaries $a=-0.5$ and $b=0.5$, and approximations using logistic functions $\sigma(x-a;\epsilon) - \sigma(x-b;\epsilon)$ for different values of $\epsilon$

Rectifier or ReLu, $\mathrm{max}(x_0,x)$ with $\mathrm{SoftPlus}$

This function is famous because it is used in neural networks and, indeed, is normally replaced by a smooth approximation because it was giving people trouble.

$$ \begin{equation} \mathrm{ReLu}\left(x-x_0\right) = \mathrm{max}(x_0,x)= \left\{\begin{array}{ll} 0 & \text{for } x < x_0\\ x & \text{for } x > x_0 \end{array}\right. \end{equation} $$

The canonical way to remedy the discontinuity is the $\mathrm{Softplus}$ function, which has form:

$$ \mathrm{Softplus}(x-x_0;\epsilon) = \epsilon\mathrm{ln}\left(1+e^{ \frac{x-x_0}{\epsilon}}\right) $$

But you can also build it with a step function as $\mathrm{ReLu}\left(x-x_0\right) = \left(x_0-x\right)\Theta\left(x_0-x\right)$ and use sigmoids. The two approximations approach the exact case from “above” and “below” respectively.

relu ReLu function $\mathrm{ReLu}(x-x_0)$ with $x_0=0$, and approximations using $\mathrm{Softplus}(x-x_0;\epsilon)$ and using a logistic function $(x-x_0)\sigma(x-x_0;\epsilon)$ for different values of $\epsilon$.

Vector Minimum, Maximum and Vector Dimension with $\mathrm{SoftMax}$

The $\mathrm{SoftMax}$ function will be familiar as the output layer of neural networks used for classification problems. This is because it happens to map from $\mathbb{R}^N$ to a Dirichlet distribution, which is how you would describe the probability distribution of an $N$-way categorical event: the output of the $\mathrm{SoftMax}$ is an $N$-dimensional vector $\mathbf{y} = \mathrm{SoftMax}(x)$ with the properties that $y_i\in[0,1]$ for all $y_i\in\mathbf{y}$ and $||\mathbf{y}||_1 = \sum_i y_i = 1$. The in-built exponential mapping also has the perk that it ought to induce deeper layers to deal in activations on the order of log-probabilities, which is numerically wise. Let $\mathbf{x}$ be an $N$-dimensinal vector, then the $i$th output of $\mathrm{SoftMax}$ is normally expressed as:

$$ \begin{equation} \mathrm{SoftMax}\left(\mathbf{x}\right)_i = \frac{e^{x_i}}{\sum_j^N e^{x_j}} \end{equation} $$

I say “normally” because there are numerical issues with using this formulation to do with the exponential functions that can be avoided using a trick. Using i.e. a PyTorch inbuilt function ought to take that into account. Introducing a parameter $\epsilon$:

$$ \begin{equation} \mathrm{SoftMax}\left(\mathbf{x};\epsilon\right)_i = \frac{e^{\epsilon x_i}}{\sum_j^N e^{\epsilon x_j}} \end{equation} $$

Three interesting limits emerge:

$$ \begin{equation} \begin{array}{l} \lim_{\epsilon\rightarrow0}\mathrm{SoftMax}\left(\epsilon\mathbf{x}\right)_i = \frac{1}{N}\\ \lim_{\epsilon\rightarrow\infty}\mathrm{SoftMax}\left(\epsilon\mathbf{x}\right)_i = \delta\left(x_i - \mathrm{max}(\mathbf{x})\right)\\ \lim_{\epsilon\rightarrow-\infty}\mathrm{SoftMax}\left(\epsilon\mathbf{x}\right)_i = \delta\left(x_i - \mathrm{min}(\mathbf{x})\right)\\ \end{array} \end{equation} $$

At $\epsilon\rightarrow0$, the function simply returns the inverse of the vector dimension. For large negative and positive $\epsilon$, you get one-hot encoded vector that is $1$ at the location of the minimum and maximum respectively and $0$ elsewhere.

softmax SoftMax function of some $N$-dimensional input vector $\mathbf{x}$ for different values of $\epsilon$. For $\epsilon=0$, all entries are mapped to the same value of $1/N$. For intermediate values, larger values are amplified and smaller values suppressed. For large $\epsilon$, the entry corresponding to the largest vector element is $1$ and all others are $0$.

Minimum and Maximum Values, $\mathbf{min}(\mathbf{x})$ & $\mathbf{max}(\mathbf{x})$

The one-hot encoding of the minimum and maximum can be used to pick out the actual value from a vector by taking the dot product. The result is tunable continuously and gives back the mean value for $\epsilon=0$ (though there are nicer ways to generate the mean!):

$$ \begin{equation} \begin{array}{l} \lim_{\epsilon\rightarrow0}\mathbf{x}\cdot\mathrm{SoftMax}\left(\epsilon\mathbf{x}\right)_i = \mathrm{mean}(\mathbf{x})\\ \lim_{\epsilon\rightarrow\infty}\mathbf{x}\cdot\mathrm{SoftMax}\left(\epsilon\mathbf{x}\right)_i = \mathrm{max}(\mathbf{x})\\ \lim_{\epsilon\rightarrow-\infty}\mathbf{x}\cdot\mathrm{SoftMax}\left(\epsilon\mathbf{x}\right)_i = \mathrm{min}(\mathbf{x})\\ \end{array} \end{equation} $$

softmax Result of $\mathbf{x}\cdot\mathrm{SoftMax}(\epsilon\mathbf{x})$ for different values of $\epsilon$. The function lets you tune between $\mathrm{min}(\mathbf{x})$ to $\mathrm{max}(\mathbf{x})$, passing through $\mathrm{mean}(\mathbf{x})$ at $\epsilon=0$. Here $\mathbf{x}\in\mathbb{R}^{100}$ and was generated from a uniform distribution $\mathbf{x}\sim\mathbb{U}(0,1)$.

If-Elif-Else

The softmax can also be used to build if-elif-else conditionals. The if/elif statements require a set of functions $f_i(x)$ and a constant $c$ so that if condition $i$ is met, $f_i(x) > f_j(x)$ for all $j\neq i$ and $f_i(x) > c$, and $f_i(x) < c$ if condition $i$ is not met. The functions $f_i(x)$ themselves also need to be differentiable. You can often also build these conditionals by using step functions.

Top-K Error

Until I find time to write this out, see this paper. It’s more elegant than what I came up with.

Dirac Delta Function, $\delta(x-x_0)$, with a Gaussian

The Dirac Delta function is often described as a function that has a single infinite spike at the origin, but that simplicity is slightly misleading, because is it not a function in traditional sense and behaves a little differently. It is defined meaningfully only in the context of integration:

$$ \begin{equation} \int_A f(x)\delta(x-x_0) dx = f(x_0) \end{equation} $$

Where $x_0\in A$. Because of this you are probably only going to encounter it in the context of doing analytical calculations, because trying to handle infinitely thin, infinitely sharp spikes numerically is a bad idea. As mentioned at the outset, a smooth approximation to the Dirac Delta function can be used to create a smooth approximation to a function by calculating the convolution with it. The most common, but not the only, way to create a smooth approximation to the Dirac Delta function is:

$$ \begin{equation} \delta_\epsilon(x) = \frac{1}{|\epsilon|\sqrt{\pi}}e^{-\left(\frac{x}{\epsilon}\right)^2} \end{equation} $$

To create a smooth approximation to a function $f(x)$, calculate:

$$ \begin{equation} f_\epsilon(x) = \int f(x')\delta_\epsilon(x'-x) dx' \end{equation} $$

softmax Smooth approximation to a dirac delta function using a Gaussian $\delta_\epsilon(x) = \frac{1}{|\epsilon|\sqrt{\pi}}e^{-\left(\frac{x}{\epsilon}\right)^2}$ for different values of $\epsilon$.
  1. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.