Augmenting LLMs Beyond Basic Text Completion and Transformation
Large Language Models (LLMs) are like an iceberg; maybe we only see the tip grazing above the surface. However, the further we dive below the surface, the bigger the iceberg becomes. The sheer amount of data they’re trained on is invisible to us and their emergent capabilities become evident as they keep scaling in size. LLMs are, by innate architectural property, very dynamic black boxes. Their ability to execute many tasks by interpreting a large quantity of word embeddings allows LLMs to solve many natural language problems in a superior manner compared to previous state-of-the-art neural networks. From summarization to translation and even generation, it seems like we frequently learn something new about what LLMs can do.
When looking at LLMs’ abilities to be able to perform previously unfathomable tasks, like text-to-code generation, it’s easy to become wide-eyed. It is remarkable to see products like Github’s Copilot build off of the text-to-code conversion capabilities of LLMs in order to allow developers to effortlessly generate code, comments, optimizations, unit tests, etc. Even though text-to-code generation is still in its early days, it sparks a line of questions regarding the next barriers in this field that will be broken. Could we see a large abstraction of DevOps, like using natural language to automatically write infrastructure code with Terraform? What about using an LLM to create complete full stack applications with fully generated documentation including resource requests?
Text-to-code’s recent emergence as a latent ability of LLMs showcases how these models can serve as high-bandwidth human-computer interfaces with the right application. Being able to ask a computer program—in English (or any number of other human languages)—to generate code is undoubtedly compelling, but that is just one of over 130 emergent capabilities of large language models. Just like now new hardware functionality or operating system features point developers in new directions, these emergent capabilities may be the building blocks of future applications of language models.
As LLMs continue to increase in complexity and robustness, they’ll yield more “second-order” applications. Instead of just outputting text, we’re starting to see transformers that go beyond the realm of core NLP tasks. But they can’t do it without a little seasoning.
Augmenting Large Language Models
In order for LLMs to tackle problems of higher complexity that extend beyond core NLP tasks, they need to become even more robust than they are. These models still suffer from non-factual but seemingly plausible responses known as "hallucinations." This can be due to erroneous encoding and/or decoding by the transformer or divergences in the vast training data. If you got a chance to read about the Sydney-Bing fiasco, it’s pretty evident why these hallucinations are a major obstacle for LLMs to circumvent if they’re to be used regularly. These errors tend to propagate with workflows involving a heavy amount of reasoning such as arithmetic tasks or chain-of-thought reasoning. When more complex problems need to be decomposed into smaller “subproblem” chunks, the chances for hallucination may be higher if the model tries to solve it head on. Strategies like training on few-shot examples and directly providing step-by-step reasoning for the model—a form of prompt engineering known as chain-of-thought prompting—help these LLMs more accurately solve complex problems through the power of procedure.
LLMs can also be “augmented” if they’re provided with “tools”. What exactly are tools? Researchers at Meta discuss the possibility of LLMs being provided with knowledge not stored in the model's weights in order to solve more realistic problems. This knowledge can be retrieved or calculated using different modules, or tools, given to the model to use in tandem with its core abilities. This includes calling another fine-tuned model, retrieving information with a search engine (see LaMDA) or the internet, solving computational problems via a code interpreter or calculator, etc. Allowing these models to use this toolbox not only minimizes the error and hallucination they may suffer from, but it also grants them capabilities outside of just producing text. Using external tools combined with methods like reinforcement learning with human feedback (RLHF), few-shot prompting, and others can enable these LLMs to learn how to appropriately use these tools themselves. With these augmentations, LLMs can become more potent and capable. Frameworks like LangChain already started to provide augmentations and tools to these models. As more projects are deployed on the LangChain framework, it feels like it’s only a matter of time before we start seeing more augmented language models powering “second-order” applications. But we might need to pump the ethical brakes before jumping too far ahead.
LLMs and broader AI already fall under ethical scrutiny and rightfully so. The safety of using these models is of paramount importance, especially when elevating their privileges with external tools. If these augmented models are to be more heavily depended on, obviously they can’t hallucinate as much and need to be strongly error-proofed. Providing tools to these models sets a further precedent of interaction with the outside world. Allowing a model to crawl the internet and hypothetically try and access sensitive and private information would be pretty unnerving. With respect to augmenting these LLMs with tools, it’s critical to distinguish between passive and active tools.
Specifically regarding the latter, acting on the physical/virtual world with no human oversight is a pretty alarming scenario that could lead to concern-exacerbating consequences. The tradeoffs for convenience with automation need to be carefully examined and considered with augmenting these models. It’s important that humans aren’t abstracted completely from the loop and have some degree of oversight on these models’ actions as opposed to full model autonomy. With the human-in-the-loop paradigm maintained, a lot of the ethical concerns associated with elevating these models’ privileges can be abated. To paraphrase everyone’s favorite comic book uncle, with great capability comes great responsibility.
Research is moving towards augmented LLMs that can engage with the real world, as opposed to being siloed off as a would-be tempest in a teapot of text. Augmenting LLMs philosophically aligns with their latent nature as they gain a higher variety of capabilities at scale. Similarly, it stands to reason that giving them more power (and its required responsibility), in the form of tools, will allow them to further perform more unique and multidimensional tasks beyond textual generation tasks that could leave a mark on the broader world.
While LLM capabilities and characteristics are still being unraveled and studied, companies have wasted no time in developing “second-order” applications. Instead of just generating text from a prompt and displaying the formatted response to the user, these applications go a step further and use the model’s response as an input for another action or sequence of actions, hence the term “second-order”. In the future, a platform like Copilot could be able to automate macro programming tasks such as DevOps or complete Fullstack development with minimal error. Let’s take a look at some of these applications!
Intelligent software agents such as Siri and Alexa have been in existence for some time now. However, they’ve been relatively limited with performing very abstract tasks, rather sticking to very explicit and low-bandwidth requests. Adept AI is a startup focused on building AI-powered HCI interfaces that can handle more high-bandwidth requests. Currently, they’ve shipped their own large model transformer called ACT-1. ACT-1, in its current product instantiation, is a Chrome extension that is able to perform high level user requests on the browser. Suppose you want to find a house for your family within a budget, or you want to scan Facebook marketplace for an item and message the seller. ACT-1 is able to perform these multi step tasks by observing browser activity from the user, fine tuning its workflow reasoning, and then executing said workflow. ACT-1 basically functions as a universal copilot and is an early indicator of how powerful software agents can become.
LLMs can also be used to supercharge search. Previous iterations of search engines often result in subpar results. This is largely due to a keyword-based approach with queries. With LLMs, there’s an optimization in context recognition, and search engines can better return relevant content that matches the query by wielding LLMs. Incumbents like Google and Bing are working on integrating (or, in the case of Bing, better integrating) LLMs as a companion to search in order to better optimize their results. Newer search engines like Neeva are also leveraging LLMs in order to deliver personalized, authoritative results, with the added capability of eliciting real-time results. Enterprise search is also being powered by LLMs. Companies like Seek and Hebbia use LLM-powered search to sift through enterprise data so that anyone can quickly and painlessly get answers for any specific questions on their data.
The “second-order” applications we’ve surveyed have been contained only to the software domain. What about LLMs that can function as hardware interfaces? Google Research and its subsidiary company Everyday Robots are working on integrating their PaLM language model into robots that execute routine manual human tasks. Robotics in its current state is programmed for a very narrow set of tasks. There’s a lot of unpredictability a human-helper robot may encounter, from its tasks to just navigating its environment. With the PaLM model integration, known as PaLM-SayCan, Google is introducing a new layer in their reinforcement learning strategy by seeing how the model interprets the request AND how the robot executes the request. It then uses this as feedback to better understand how to perform similar tasks in the future, as well as better breakdown complex tasks using chain-of-thought reasoning - just like the augmented LLMs we discussed. With the integration of PaLM into Google’s Everyday Robots, there was a 14% improvement in the ability to map a viable approach to a task, 13% improvement on the ability to successfully perform a task, as well as a 26% in planning tasks which consisted of eight or more steps.
LLMs are so much more than just their capacity for text generation. The latent abilities unlocked (somewhat unintentionally) by scaling large language models to ever higher parameter counts may enable a new paradigm for human-computer interaction. By augmenting large language models with external tools, and discovering new creative ways to use them, researchers and companies are better able to develop LLM use cases that can better serve users outside of pure textual generation. To gaze even further into the future, can these “second-order” applications of LLMs result in even more abstract, complex “third-order” applications?
Artificial general intelligence (AGI) may still seem like a pipe dream, but the needle gets moved further with the constant breakthroughs in state-of-the-art AI. With the progress of technology like software AI agents, multimodal foundational models, and the emergence of exciting (but still incredibly nascent) technology like AutoGPT which allows for a higher degree of human abstraction, the heretofore far-out prospect of “true” AGI feels just that much closer. Needless to say, the advancements of large language models will continue to generate more complex applications powered by their progress. The “second-order” applications are already being built, so it’s quite easy to speculate about the tertiary class of applications that will derive from them. Regardless, exciting times are ahead for the progress of foundational large language models and the applications they’ll yield!