Liquid Neural Networks: Fluid, Flexible Neurons

Why Worms?
What Exactly is “Liquid” about Liquid Time-Constant Networks?
Spiking and Graded Potential Neurons
How Do Liquid Time-Constant Neural Networks Learn?
Benefits of Liquid Time-Constant Neural Networks
Dynamic Yet Bounded
Expressive
How Liquid Time-Constant Neural Networks Handle Time Series
Human Activity Dataset
Half-Cheetah Kinematic Modeling
Improving Liquid Time-Constant Neural Networks
Neural Circuit Policies: Wormesque Wiring
Closed-form Continuous-Time Neural Networks: Faster Liquid Neural Networks
A Company Dedicated to Liquid Neural Networks

Share this guide

By Brad Nikkel

AI Content Fellow

Last Updated

Jul 3, 2024

Artificial neural networks (ANNs) are so flexible that, given sufficient depth and breadth, there exists an ANN that can theoretically approximate any function (at least well enough to be useful for most applications).

But many ANNs are also rigid; we feed them a bunch of data; they learn complex patterns from this data; and we then apply their pattern recognition prowess toward some task. A nagging limitation with their design is that most ANNs’ knowledge is frozen within their weights after training.

This means that if (or more realistically, when) a traditional ANN encounters something significantly different from the data it trained on (i.e., out-of-distribution data), it’s apt to stumble beyond acceptable performance. If (or, more likely, when) that happens, we typically need to hack together some tricks or fine-tune the ANN on an updated dataset that includes examples of whatever data tripped it up.

But retraining ANNs whenever we discover they have significant knowledge gaps is unacceptable for safety-critical applications (e.g., autonomous vehicles, medical devices, or aerospace applications). And even for applications as benign as recommendation systems, for example, routine retraining is a cumbersome, costly method of updating ANNs’ knowledge (in time and compute). We need ANNs that are fluid enough to learn on the fly, continually tweaking their weights in response to new inputs (like our biological brains do).

On several occasions, Meta’s Chief AI Scientist Yann LeCun has remarked that before we get human-level artificial intelligence, we ought to first create simpler dog or cat-level artificial intelligence (which he believes our current State-of-the-Art (SoA) models remain far from).

Over the past few years, a team or MIT researchers took LeCun’s first-aim-for-simpler-than-human-level-AI-approach to whole 'nother level by turning to worms for inspiration.

MIT researchers Ramin Hasani, Mathias Lechner, Alexander Amini, and Daniela Rus developed several variants of more adaptive ANNs dubbed "Liquid Neural Networks" (LNNs), roughly modeled after a 2 mm-long nematode’s nervous system. These LNNs can handle time series data with far fewer neurons than the recurrent neural networks (RNNs) typically used for temporal data, making them cheaper (for training and inference) and more interpretable. Several tests suggest LNNs are more expressive, stable, and causal (grasping the “why” behind their predictions). Though not without weaknesses, LNNs clearly enjoy strengths across several areas. We’ll learn how LNNs’ work and how Hasani et al. tested them against other models in a bit, but first we should address how exactly a squirmy species spurred a new approach to ANNs.

Why Worms?

Squinting at the lowly nematode’s nervous system with the hopes of emulating intelligence might seem like an odd bet, but doing so granted Hasani et al. an undeniable advantage—simplicity.

Image Source: Bob Goldstein (Crawling C. elegans hermaphrodite worm)

The Caenorhabditis elegans (C. elegans) worm has 302 neurons. These “wired” connections have long been mapped into a connectome (since 1986), along with, more recently, much of its “wireless” neuropeptide interactions. It’s far more tractable to model a worm’s few hundred neurons than it would be to do the same for a fruit fly’s nearly 200 thousand neurons or a human’s amalgamation of approximately 86 billion neurons (a tangled mess that neuroscientists are still making sense of).

It’s not just the broad human-worm neuronal quantitative divide, though. A single C. elegans neuron is smaller than a single human neuron, has a simpler structure, is subject to fewer neurotransmitters' influence, and connects to a smaller number of nearby neurons. All these attributes make a single human neuron more expressive than a single C. elegans neuron.

Since it’s tough to get a more simple, more understood nervous system to model than C. elegans’, Hasani et al. investigated how this worm’s neurons and synapses communicate with one another, abstracting these processes as nodes of connected differentiable equations that communicate with one another through input-dependent linear functions. But that’s a mouthful, so let’s look at this high level before we go under the hood.

What Exactly is “Liquid” about Liquid Time-Constant Networks?

Imagine you decide to curate an artificial intelligence (AI) newsletter like AI Minds. Given the current pace of AI research, to separate the noteworthy from the bland, you’d need to scan oceans of headlines, recollecting what’s been done before and forgetting redundant studies, lest you suffer from information overload.

This would require shifting your focus to different subfields of AI, partly based on the “rate of change” of their outputs. For example, maybe one month, multiple researchers crank out hundreds of novel reinforcement learning papers, and then that tapers off and the release of several multimodal models over the span of a few weeks grabs all the buzz, and after that, a company drip releases one new domain-specialized speech model per day for five days. While complex enough to give even a meticulously organized editor a migraine, this scenario enjoys a simplifying aspect; both its inputs and time steps are discrete; you can only discover one article at a time (even if at a rapid pace).

Now let’s consider a similar scenario, except continuous. When we drive a vehicle, we calibrate the rate at which we ingest novel information and let go of the old, mostly on autopilot. Our forests of neurons respond strongly to novel stimuli, like a car cutting in front of us or an upcoming turn in the road, while dialing down routine stimuli, like the horizon of a straight, empty highway we’re cruising down. In a sense, this is more complex than skimming AI articles because no clear input or time boundaries exist, yet our neurons manage, constantly adjusting their response in proportion to how usual or unusual inputs are. Artificial neurons ought to constantly adapt too.

Building on research in Neural Ordinary Differential Equations (N-ODEs) and Continuous-Time RNNs (CT-RNNs), Hasani et al.‘s designed their first LNN variant— Liquid Time-Constant Networks (LTCs)—to mimic neurons’ continuous dynamic calibration processes at a sub-neuron scale.

To do this, LTCs borrow from Louis Lapique’s 1907 “leaky-integrator” neuron model:

The two main parts of this model are as follows:

Hasani et al. wanted S(t) to model non-linear synapse activity (synapses are tiny gaps between neighboring neurons that facilitate signal passing between them), which is a key departure from many ANNs and LTCs.

Image Source: Hasani et al. (Neural and synapse dynamics)

Spiking and Graded Potential Neurons

Most current ANNs model spiking neurons as discrete processes; either a signal fires—passing a scalar weight representing an electrical charge to the next neuron—or it doesn’t. In contrast, LTCs model synapse activity as a continuous process (a gradual, varying flow of neurotransmitters and ions affecting a neuron’s and its neighboring neurons’ action potential, which influences how signal propagate between neighboring neurons).

Both humans and C. elegans have spiking and non-spiking neurons, but C. elegans have more non-spiking (graded potential) neurons than spiking. Since Hasani et al. wanted to model C. elegans’ nervous system, they used Hodgkin and Huxley’s 1952 conductance-based model of synapse activity:

Let’s look at what each part of this represents. The image below explains each term.

S(t) is the total synaptic input to a neuron at time t.
f( x(t), I(t), t, theta ) represents how synapses transmit signals to each other. It takes in:
x(t), the current state (or membrane potential) of a postsynaptic neuron (the neuron that would receive the signal from the current neuron) at time t.
I(t), a neuron’s external input (the stimuli) at time t.
t, the current time step.
theta, the parameters of the function f, which can be learned during training. f can be a linear or nonlinear function (e.g., a sigmoid or a rectified linear unit) that modulates synaptic transmissions based on a neuron’s current state, x(t), and its external input, I(t).
( A - x(t) ) represents the driving force of the synaptic input.

A is a constant representing a synapse reversal potential (when the chemical signal flow direction flips). The further the current membrane potential x(t) is from A, the stronger the synaptic input affects the neuron (by modeling increased ion flow). And, vice versa, as x(t) approaches A, the synaptic input diminishes (modeling decreased ion flow).

Image Source: Hasani et al., 2021 (Notice how the structure of arbitrary connected neurons is mostly the same for standard ANNs and LNNs. The main difference is that standard ANNs’ weights (W) and biases (B) are modulated by scalar values, whereas the LNNs’ weights and biases are modulated by differential equations.)

Traditional ANNs tend to update their knowledge at fixed intervals (one neuron layer at a time), while LTCs’ individual neurons—in response to their own current state and incoming data—adaptively adjust the speed at which they update their knowledge (growing more receptive or more indifferent to incoming stimuli with time), similar to how neurons in your brain might fire rapidly when you first read “Attention is All You Need” and flicker sporadically after skimming your 100-th transformer-related paper. It’s LTCs’ constant, fluid adjustments that led Hasani et al. to describe their neural networks as “liquid.”

How Do Liquid Time-Constant Neural Networks Learn?

To mimic this "liquid" behavior, each artificial neuron in an LTC uses the Ordinary Differential Equation (ODE) that we discussed above to model its varying responsivity. Solving ODEs can be computationally expensive, though, so Hasani et al. employed a fixed-step numerical ODE solver that balances the stability of slower, implicit solvers with the speed of faster, explicit ones.

Like most ANNs, LTCs need to train on data to “learn” to produce useful outputs. To do this, Hasani et al. chose Backpropagation Through Time (BPTT), a method that reduces errors by tracing from outputs back to inputs. In LTCs, BPTT uniquely affects the neurons since each neuron is governed by an ODE. During training, BPTT computes gradients for these ODE parameters, adjusting not only connection weights but also each neuron’s time constants and dynamic properties. This enables the network to fine-tune how each neuron's state evolves over time, capturing complex temporal patterns.

Benefits of Liquid Time-Constant Neural Networks

Hasani et al. demonstrated several advantages of LTCs, namely that they’re provably stable, they’re expressive, and they predict different time series data well. Let’s look at each of these more closely.

Dynamic Yet Bounded

Ever-learning, ever-updating systems like LTCs have the potential to go haywire without some constraints on their internal knowledge states and on their learning rate; neither should be too high nor too low.

Hasani et al. proved that their LTCs’ hidden states and time constants (we can think of these as LTC-neurons’ independent, varying learning rates) are both bounded within safe ranges, ensuring they’re stable and predictable, even if their inputs become huge.

Expressive

To compare different models’ expressivity, Hasani et al. used a test that compares the trajectory lengths of inputs fed through artificial neural networks.

ANNs transform their inputs more and more with each progressive layer of neurons. By measuring how much an input changes as it travels through a neural network (i.e., how complex the path that it takes is), we can gauge that network’s expressivity.

If we put a circle, for example, through a series of processes that change its shape, we can measure how complex the new shape is by looking at how much it stretches and twists. The more contorted the path, the more complex the patterns the network can handle. This means we can roughly quantify how complex an ANN’s representation is by measuring the length of this path. You can see this in the image below.

Image Credit: Raghu et al. (Going rightward, you see the evolving trajectory of a circle fed through further hidden layers of an ANN; the trajectory lengthens as it passes through subsequent hidden layers.)

Using trajectory length tests, here’s what Hasani et al. found out about LTCs’ expressivity:

Complexity: LTCs generally create more complex trajectories than other types of networks, indicating that they better handle complex tasks.
Growth of Complexity: This complexity increased as the network widened and as the network’s distribution of weights changed.
Effect of Activation Functions: Different activation functions (like ReLU or Hardtanh) can affect the complexity of the trajectories.
Depth Impact: Unlike some ANNs, increasing LTCs’ depth (number of layers) doesn’t always lead to more complex trajectories.

The below charts demonstrate these findings:

Image Credit: Hasani et al. (Trajectory length deformation as a function of network width (ReLU): The LTC’s trajectory length was significantly longer (meaning more expressive) than the N-ODE’s and CT-RNN’s. Also, the expressivity jumped as the width increased.)

Image Credit: Hasani et al. (Trajectory length deformation as a function of network width using Hardtanh: The Hardtanh LTC’s trajectory length was significantly more expressive than the N-ODE’s or CT-RNN’s and more expressive than the ReLU LTC above in figure C.)

Image Credit: Hasani et al. (Trajectory length per network layer using Tanh: Note how the LTC remained more expressive than the N-ODE and CT-RNN across all layers.)

Hasani et al. showed quantitatively that LTCs can capture more complex patterns than N-ODEs and CT-RNNs by comparing the trajectory lengths of different models. This suggests that LTCs can process complex, constantly changing data well.

How Liquid Time-Constant Neural Networks Handle Time Series

To see how LTCs handle real-world data, Hasani et al. conducted time-series prediction experiments in a few domains. They pitted their LTCs against SoA discretized RNNs, LSTMs, CT-RNNs (ODE-RNNs), Continuous-Time Gated Recurrent Units (CT-GRUs), and Neural ODEs. LTCs had a 5% to 70% performance edge over other RNN models in four out of seven experiments and similar performance in the rest.

Let's look at a couple of these experiments in more detail:

Human Activity Dataset

Hasani et al. put LTCs to the test using the Human Activity dataset, which contains thousands of sequences of human activities like walking, sitting, and lying down, recorded from smartphone sensors. The goal was to see how well different models could classify these activities based on the sensor data.

They ran two different versions of this test, each with its own setup and range of models for comparison. In both cases, LTCs showed impressive performance:

In the first test, LTCs achieved an accuracy of 85.48%, slightly edging out other advanced models like CT-GRUs and LSTMs.
In the second, more challenging version of the test, LTCs really shined, reaching an accuracy of 88.2% and outperforming a wider range of specialized models.

Below are the results:

Image Source: Hasani et al. (Human Activity Test Setting 1)

Image Source: Hasani et al. (Human Activity Test Setting 2)

Half-Cheetah Kinematic Modeling

To test how well these models capture physical dynamics, Hasani et al. used the HalfCheetah-v2 gym environment, powered by the MuJoCo physics engine. The task? Predict the motion of a two-dimensional cheetah-like robot based on its joint angles and control outputs.

Image Source: Hasani et al. (Half cheetah physics simulation)

To crank up the challenge, they randomly overwrote 5% of the actions, tossing some chaos into the system. Despite this curveball, LTCs still outshone the competition. They achieved a mean squared error of 2.308, significantly lower than the next best performer, LSTM, which scored 2.500.

Image Source: Hasani et al. (Half cheetah test results)

These results suggest that LTCs aren't just theoretical; they adeptly handle real-world time-series data. Whether it's predicting human activities or the complex dynamics of a robotic cheetah, LTCs seem to have a knack for capturing the underlying patterns in temporal data.

Improving Liquid Time-Constant Neural Networks

Since their LTC paper, Hasani et al. and more researchers have built two key improvements on LTCs: Neural Circuit Policies (NCPs) and Closed-form Continuous-Time neural networks (CfCs). Let’s take a look at each.

Neural Circuit Policies: Wormesque Wiring

After initial tests showed that individual LTC units modeled temporal data well, Hasani et al. wanted to push LTCs further, adapting them to real-world control tasks like autonomous driving. To do this, they strung together LTCs into biologically-inspired network architectures called Neural Circuit Policies (NCPs).

NCPs are inspired by a distinct four-layer hierarchical neural network circuit commonly found in C. elegans’ nervous system. This circuit includes:

sensory neurons that take in sensory input
inter-neurons and command neurons that generate an output decision
motor neurons that engage the worm’s muscles

Feedforward connections tend to link sensory neurons with inter-neurons and command neurons with motor neurons, and recurrent connections tend to link inter-neurons and command neurons. Using these components and their relationships, Lechner et al. devised the following 4-step process for crafting an NCP:

Create four layers of neurons (sensory, inter-neurons, command, and motor neurons)
Connect these layers according to the following rules:
For each neuron in a layer, create some connections (synapses) to neurons in the next layer.
The number and targets of these neuron connections are chosen randomly, based on binomial distribution.
Each connection is randomly set to be either excitatory or inhibitory based on a Bernoulli probability distribution.
Ensure all neurons are connected.
If a neuron has no incoming connections, give it some.
Again, the number and sources of these connections are chosen randomly (using the same probability distributions from step 2).
Add recurrent connections.
In the command neuron layer, add connections between command neurons.
The number, targets, and nature (excitatory/inhibitory) of these connections are also randomly determined.

This random, probabilistic approach to building NCPs mimics the variability found in C. elegans’ wetware, creating an overall organized model while allowing small, random variations.

Lechner et al. made their "liquid" approach to neural networks more useful by modeling not only how this nematode's nerve cells interact with each other (LTCs) but also how they connect to form neural circuits (NCPs).

What Can Neural Circuit Policies Do?

Lechner et al. found that a surprisingly compact NCP—with just 19 control neurons connected by 253 synapses—could effectively map high-dimensional inputs (camera feeds) to steering commands for autonomous vehicles. This tiny network not only outperformed much larger black-box systems in generalizability and robustness but also performed well under the same noisy conditions that confused CNNs and LSTMs (where human drivers often had to intervene to avoid a crash).

Image Source: Lechner et al. (As input noise was introduced during real driving tests, NCPs “crashed”—meaning required human intervention—the least.)

Perhaps most importantly, thanks to NCPs’ small size and every NCP neuron’s behavior being described by a differential equation, NCPs are far more interpretable than traditional deep learning models. Lechner et al. used principal component analysis to look into how each neuron worked during the driving tests. They found that NCP's first principal component handled the elementary parts of driving (going straight, turning left, and turning right), while its second principal component handled more specific tasks. LSTMs and CNNs, on the other hand, required 2 to 3 principal components just for the elementary parts of driving.

Image Source: Lechner et al. (Principal components projected onto straight, right, and left turns of a driving course: k = LSTM, l = CT-RNN, m = CNN, n = NCP)

Lechner et al. also found that specific neurons that were consistently active when driving straight maintained a mostly constant reaction speed, whereas neurons that were consistently active during turns increased their reaction speed. This suggests that specific neurons learn to represent specific phenomena and adjust their time constant according to that task. It also means we can feasibly trace back decisions to specific neurons and connections within NCPs and adjust them for specific tasks.

NCPs success at autonomous driving tasks shows that tackling complex, real-world problems doesn’t necessarily require giant neural networks. Lechner et al. created a powerful, lean, and transparent model by threading together LTC neurons into a neural-circuit hierarchy similar to a worm’s. While NCPs show us that the principles of LTCs can be scaled up to improve their usefulness, NCPs still use LTCs, which solve differential equations, which can be slow.

Closed-form Continuous-Time Neural Networks: Faster Liquid Neural Networks

Recall that Hasani et al.'s LTCs used a hybrid numerical differential equation solver (a mix of explicit and implicit solvers). Numerical solvers iterate across many tiny steps to find an approximate solution to differential equations, making them computationally expensive. This means LTCs that use them can be too slow for some applications.

A faster option would be closed-form differential equation solutions, which are symbolic, mathematical formulas often derived by hand. If we knew all the closed-form differential equation solutions needed for LTCs, we could speed them up. But, for many differential equations, no known closed-form solution exists.

To overcome this limitation, Hasan et al. approximated closed-form solutions (to LTCs’ differential equations) that they proved stay within tight upper and lower bounds of error. This close-form solution maintains accuracy close to that of LTCs’ while giving CfCs a one to five orders of magnitude increase in training and inference speed over LTCs that use numerical solvers.

Image Source: Hasani et al., 2022 (CfC’s still use LTC’s model of synapse-neuron interactions but instead of using a slow, costly numerical solver they employ a fast, approximate closed-form solver)

Hasani et al. tested a handful of their CfC variants against a wide range of baseline models. These included classical RNNs (LSTMs, GRUs, RNN-Decay, etc.), ODE-based models (ODE-RNN, Latent-ODE), continuous-time models (CT-RNN, CT-LSTM, etc.), and advanced RNN architectures designed for long-term dependencies (Legendre memory units, HiPPO, etc.) and irregular data (GRU-D, PhasedLSTM, etc.). In most tests, the CfCs achieved more accurate results at significantly faster training speeds. Here’s an overview of the tests and their results:

Human Activity Recognition

Task: Classify smartphone sensor data to its corresponding human activity (e.g., sitting down, walking, driving, etc.)
Result: CfCs outperformed other models’ accuracy by a few percentage points with much faster processing times (a 8,752% speedup over the best ODE-based model).

Physical Dynamics Modeling (with the Walker2D dataset)

Task: Using the MuJoCo physics engine, model a simulated two-legged robot’s future joint positions as it “walks.”
Result: CfCs beat other baselines, including transformers, with significant speedups.

Event-based Sequential Image Processing

Task: Classify MNIST hand-written digits that were flattened to a single vector and transformed into irregular sequences.
Result: CfC variants outperformed other models in accuracy (~98%) and especially in speed (200 to 400% faster).

Bit-stream XOR

Task: Classify bit streams using an XOR function over time.
Result: CfCs performed well on regularly and irregularly sampled bitstreams, whereas other models struggled with irregularly sampled data due to their vanishing/exploding gradient problem.

PhysioNet Challenge

Task: Predict patient mortality based on a time series of medical measurements.
Result: CfCs performed competitively while being 160 to 220 times faster than other continuous models and even three times faster than some discretized models.

IMDB Sentiment Analysis

Task: Starting from scratch (randomized word embeddings), learn to classify sentiment in movie reviews (positive or negative).
Result: CfCs with mixed memory instances outperformed other RNNs.

Autonomous Driving

Task: Map pixel observations to steering commands, testing how well a model can keep an autonomous vehicle in its lane.

Result: With only around 4k trainable parameters, CfCs demonstrated consistent attention patterns, even under noisy conditions, performing on par with LTC-based networks (though all models were generally able to keep the tested vehicle on the road and in its lane at approximately 30 km/h). Check the attention maps below to capture where the tested models tended to attend to.

Image Source: Hasani et al. (Test in Summer: Note how CfCs and NCPs tend to focus ahead near the horizon (similar to human drivers), whereas the CNN and LSTM focus directly in front of the vehicle.)

Image Source: Hasani et al. (Test in Winter: Note how CfCs, NCP, and LSTM attend to the horizon, whereas the snow seems to confound the CNN.)

Image Source: Hasani et al. (Test Under Noise: Hasani et al. corrupted this test's inputs with Gaussian noise. Note how the CfCs and NCP handled the noise better than the CNN and LSTM.)

Overall, CfCs tended to perform in the ballpark of LTCs accuracy-wise and considerably better and significantly faster than other ODE-based neural networks.

A Company Dedicated to Liquid Neural Networks

Hasani et al. believe in LNNs so much that they founded a company, "Liquid AI," which builds custom LNN-based solutions for other companies (i.e., liquid neural networks as a service).

In a recent interview with “This Week in Startups,” Hasani told host Jason Calacanis that Liquid AI is working on “liquid” foundational models and has already developed language models that can run on Raspberry Pi’s. Though he didn’t specify specific performance metrics, Hasani claimed Liquid AI has models that are between 10 and 1000 times more efficient at inference energy use and 10 to 20 times more efficient at training than transformer-based models.

Others are pumped about LTCs’ potential too. Stability AI’s founder and former CEO, Emad Mostaque, recently opined that LiquidAI is one of the most exciting generative AI companies today. Time will tell if LNNs live up to their early praise, but if you’re not keen on waiting, fret not—you can start experimenting with LNNs today thanks to Mathias Lechner’s (of Liquid AI) Neural Circuit Policy Python pip package (it contains LTCs, CfCs, and NCPs, and a few Google Colab notebooks to play around with). Even if LNNs turn out to be fruitless (which seems unlikely), it’s clear that the bio-inspired engineering approach that Hasani et al. leaned on to develop them holds plenty of untapped potential for designing novel ANNs.