Breaking the Gradient: Supervised Learning with Non-Differentiable Loss Functions
Zian (Andy) Wang
Supervised learning is a fundamental machine learning task involving training a model to make predictions based on labeled data. The process typically starts with inputting data into the model and using it to generate predictions. These predictions are then compared against the ground-truth data through a function that quantifies the prediction error, which we refer to as the loss/objective function. To improve the model's performance, we need to update its parameters. We then take the gradient of the loss function with respect to the network's parameters, which allows us to determine the direction of parameter update that will "nudge" the model towards the global minima. This prediction, evaluation, and update cycle is repeated until the model converges. The above process only generalizes to Deep Learning approaches since not all classical machine learning models follow the same training scheme.
Aside from the model, the next crucial component in the training process is the loss function. In supervised learning, the loss function assumes the role of an "optimizee." At the same time, the optimizer, an algorithm outlining how parameter updates are executed, optimizes the loss function through its computed gradients. The gradient of the loss function can be seen as a tool that "sculpts" out the loss landscape; without it, the model would be no better off than a blind man traveling tens of thousands of mountain ranges because the model cannot even determine which way is up or down!
The chosen loss function must therefore be differentiable and accurately reflect the model's performance. However, in the ever-so-intricate world of today, not every problem can be described and optimized by a differentiable function, and not every carefully crafted objective function is differentiable.
The Problem With Most Loss Functions
To illustrate, in the scenario of biomedical image segmentation, the usual dice loss and Jaccard Index (IoU) might not suffice. The model is often required to identify small objects in a complex medical image, which can span hundreds or even thousands of pixels in resolution. Using only the amount of overlap between the predicted mask and the ground truth (using the dice loss) to measure error can be problematic. This is because it doesn't give the model any feedback about how close its predictions are to the location of the ground truth when there is no overlap between the pixels. Other cases in which the loss function may be non-differentiable are when the model's predictions are used as an intermediate stage in a pipeline that generates labels. For example, the model may predict specific characteristics of a 3D structure, but the actual labels can only be generated once the model predictions are fed into third-party rendering software. In such cases, the loss function cannot be calculated directly as the network cannot "back propagate" through the third-party software.
The truth is, there are many more real-world situations where there are no concrete roadblocks preventing engineers from crafting a differentiable loss function, but rather that the function needs to articulate more information to the model for it to improve and converge.
Genetic and Evolutionary Based Algorithms
Searching for no-gradient optimization techniques is one easy way to train Deep Learning models with non-differentiable loss functions. Due to the absence of backpropagation and gradient computation in traditional evolutionary and genetic algorithms, networks with non-differentiable loss functions can be trained. These optimization techniques typically treat the loss function or optimization objective as a "black box," and they "tune" the model based on the feedback they get from the "black box environment." These algorithms are heavily inspired by the evolutionary process of nature or the behavior of specific organisms. Particle Swarm Optimization, Genetic Algorithm, and Ant Colony Optimization are a few of the well-liked options. These algorithms rely on fascinatingly simple rules and guidelines to adjust parameters and eventually reach convergence iteratively. Although discussing these algorithms' inner workings is beyond this article's scope, most evolutionary/genetic algorithms can be adapted to train neural networks. One clear advantage of no-gradient optimization training schemes is that they are less likely to become "stuck" in local minimums than conventional gradient-based optimizers because they modify the model based on environmental feedback rather than mathematically minimizing a function.
However, in some cases, the drawbacks of these algorithms outweigh their advantages. Notably, evolutionary algorithms are proven to take absurdly long periods to train compared to standard Backpropagation. This characteristic arises from the fact that their training mechanisms rely more on the notion of "trial and error" rather than mathematically calculating the optimal parameters. Another caveat to these approaches is that they usually don't have out-of-the-box implementation that can work with neural networks nor ready-to-use libraries that can utilize GPUs to speed up the training process.
Suppose you're not interested in writing algorithms from scratch and speeding it up manually by writing low-level code for GPUs. In that case, you might look for methods that can be implemented in the already existing popular Deep Learning frameworks. We then might turn back to the concepts used by evolutionary algorithms–treating the loss function as a black box environment–but "modernize it" with Deep Learning. For those that are familiar, this sounds a lot like Reinforcement Learning. Specifically, we can try to adopt the Actor-Critic method for Supervised learning.
Adapting Actor-Critic to Supervised Learning
Reinforcement learning is a subfield of machine learning where an agent is trained to interact with an environment by performing actions based on policies to maximize a reward signal. The agent learns by adjusting its policies based on the feedback it receives from the environment. The ultimate goal is to learn a policy that maximizes the expected cumulative reward over time. Reinforcement Learning is commonly employed in areas where a series of dependent and sequential actions are required to optimize the problem (self-driving cars, games, etc.), unlike supervised learning, which only has an input and a prediction.
To illustrate, imagine a game of chess where you are the "agent". At each turn, you can make a move, and each action's result will grant you a reward. The "environment" would be your opponent, whether a computer or a human. The environment would respond to your moves by moving its pieces. For instance, you could gain a reward of +1 point for each move you make, while an action that results in your opponent capturing one of your pieces would cost you -5 points.
Additionally, you could receive a reward of +5 points for capturing one of your opponent's pieces and +10 points for achieving a check. The design of the reward function can be as complicated as you'd want, and different reward functions may encourage agents to take different strategies under the same environment. Now, there are different approaches to optimizing an agent's policy, and our interest in this article lies in the Actor-Critic method.
To its name, the Actor-Critic method consists of two function approximators, or just neural networks, with one taking on the role of an "actor" while the other being the "critic" of the actor. The actor network approximates the policy: it receives the agent's current state as input and outputs an action. The critic network approximates the action-value function–an indicator of how "good" the action the actor performs.
The action-value function maps a state-action pair to the expected cumulative reward that the agent can obtain by taking that action in that state and then following a given policy. The action-value function discounts future rewards by multiplying them by a factor known as the discount rate. This factor determines how much weight is given to future rewards compared to immediate rewards.
The actor network seeks to maximize the action-value function that the critic network represents. The action from the state-action pair input to the critic network is generated by the policy or the actor network. The actor network then backpropagates through the critic network based on the value it produces. On the other hand, the critic network aims to predict the cumulative reward more accurately based on the current reward received and an estimation of future rewards. The details of Actor-Critic are much more than what is presented here, but these concepts can inspire us to adapt it for supervised learning.
Supervised learning simplifies matters a lot since, without the need to account for future rewards or the past mistakes that the model may have made, Actor-Critic is readily adaptable to supervised learning. Instead of estimating the cumulative return from the future, we can let the critic network predict the current reward or the value of our loss function. The objectives of the actor and critic network will then be the following.
In practice, the actor and critic networks can be updated alternately. Specifically, the "Actor-Critic for supervised learning" approach can be outlined as follows.
Initialize the actor and critic networks.
Input the features to the actor network, and record its predictions.
Input the predictions and features to the critic network, and record the "goodness" that it outputs.
Compare the output of the critic network with the actual value obtained by the loss function, and train the critic network using any regression-based loss.
Train the actor network by maximizing the output of the critic network and, in other words, using the critic network output as the actor network's loss value.
Repeat steps 2 to 5 for the desired number of iterations/epochs.
Remember that this approach is not limited to only outputting "discrete actions" like a typical Actor-Critic algorithm. In fact, the actor network can be any predictor as long as the critic network can accept the output format of the actor network. We can also easily adapt the above outline to batched training by storing states, actions predicted by the actor network, values predicted by the critic network, and the actual loss value in a "memory" and only update the networks when the number of samples seen is greater than or equal to the batch size.
Implications and Potentiality
Adopting concepts from Reinforcement Learning for complex supervised learning problems presents an exciting array of possibilities. Although Actor-Critic is not necessarily the optimal algorithm for Reinforcement Learning, the adaptations we made to apply it to supervised learning can be utilized to modify other RL algorithms, such as Proximal Policy Optimization, a usually much superior choice over Actor-Critic. Although a relatively less popular area and consideration of recent times, the idea of training networks without needing a differentiable loss function has been explored, whether directly or indirectly.
The paper "Learning to Learn: Meta-Critic Networks for Sample Efficient Learning" by Flood Sung et al. proposes a framework based on Actor-Critic methods for meta-learning in supervised and semi-supervised learning environments. Furthermore, the exciting forward-forward algorithm explored by Gregory Hilton recently completely throws the notion of "loss functions," as we know it, out of the window. The algorithm trains each layer of the network individually. By more closely resembling the activities of biological neurons rather than mathematically optimizing a loss function, forward-forward can pipeline sequential data through a neural network without storing neural activities or propagating error derivatives.
As modern ML problems grow in complexity, the demand for flexible and adaptable algorithms beyond Backpropagation will increase. Hilton notes in the paper on the Forward-Forward algorithm, "backpropagation remains implausible despite considerable effort to invent ways in which it could be implemented by real neurons," and the backpropagation algorithm lacks biological relevance. The advancement of efficient optimization frameworks for neural networks may be geared towards methods similar to Reinforcement Learning, where the learning environment and strategy are similar to the actions performed by us or imitating how biological systems and neurons propagate their learning through our brains, like the forward-forward algorithm.