we have developed a system that can infer and model human actions on computer applications, perform the actions reliably and quickly, and is well-suited for deployment in various AI assistants and operating systems. our system is called the Large Action Model (LAM). enabled by recent advances in neuro-symbolic programming, the LAM allows for the direct modeling of the structure of various applications and user actions performed on them without a transitory representation, such as text. the LAM system achieves results competitive with state-of-the-art approaches in terms of accuracy, interpretability, and speed. engineering the LAM architecture involves overcoming both research challenges and engineering complexities, from real-time communication to virtual network computing technologies. we hope that our efforts can help shape the next generation of natural-language-driven consumer experiences.
background
the proliferation of personalized experiences delivered via mobile applications has been a dominant theme of the last decade of personal computing. Enabled by a graphical user interface, these applications allow interaction without any programming experience. Under such a scheme, service providers enhance hardware with various software add-ons, while hardware manufacturers enable new types of applications through new features, forming a symbiotic relationship.
The advent of neural language models, a breakthrough in natural language processing, has enabled new capabilities in machine comprehension of a wide range of natural language-related problems in daily life [11], [12]. With the equally rapid commodification in speech recognition [13] and synthesis [14], where cost and latency have dropped below the real-time barrier while maintaining fidelity, we can already build machines that understand human intentions with sufficient depth, nuance, and context, with realistic effort and cost.
Enabled by such progress, a new kind of user interaction experience arises where a device`s primary user interface is via spoken natural language instead of touch [A]. The process started a decade ago in smart speakers (Alexa [15], Google Home, Raven, DuerOS), and has recently accelerated with AI chatbots (ChatGPT, Claude) and operating systems with a natural language-based user interface (rabbit). Designing this new type of device comes with a series of challenges previously unforeseen, the most significant of which is the unavailability of application programming interfaces (API) for major service providers [B]. To address this issue, we take advantage of neuro-symbolic programming to directly learn user interactions with applications, bypassing the need to translate natural language user requests into rigid APIs. To ensure a seamless on-device user experience, we face a broader range of engineering challenges involving real-time communication, virtual network computing, and web automation. We characterize our system as a Large Action Model, or LAM, emphasizing our commitment to better understand actions, specifically human intentions expressed through actions on computers, and, by extension, in the physical world. We believe our efforts can contribute to the further embodiment of intelligent algorithms that positively benefit our daily lives.
“the Large Action Model, or LAM, emphasizes our commitment to better understand actions, specifically human intentions expressed through actions on computers and, by extension, in the physical world.”
why a new neuro-symbolic model?
our key observation is that the inherent structure of human-computer interactions differ from natural language or vision.The applications are expressed in a form that is more structured than a rasterized image, and more verbose and noisy than a sentence or paragraph. The characteristics we desire from a LAM are also different from a foundation model that understands language or vision alone: while we may want an intelligent chatbot to be creative, LAM-learned actions on applications should be highly regular, minimalistic (per Occam`s razor), stable, and explainable.
language models are ill-equipped to comprehend applications with raw text
we measure the tokens required to represent common web applications across different snapshots in raw HTML. State-of-the-art large language models, with their existing tokenizers, have trouble fitting the raw-text application representation within their context window.
*Graphic Below - “Airbnb” & “YouTube”
Neural language models, by design, are not well suited to perform these tasks alone. Although they have shown the ability to understand and utilize application programming interfaces [16], user interfaces are very different and fundamentally less compatible with text. This implies that any neural language model operating a user-oriented interface would require a preprocessing step to transform such applications, and the actions performed on them, into a transitory representation of either raw text, rasterized images, or some tokenized sequence. Then, some form of reasoning would be performed using either test-time-adaptive prompt templates [23], instruction-driven [8], or reinforcement-learning-based fine-tuning [6]. Language models here would need to serve as end-to-end (action) reasoners, a task they still struggle to perform well [17]. This approach has several additional disadvantages: a tokenized sequence or a pixel array discards important structural information contained in applications [25], is often too long and noisy even for the most powerful and context-aware language models [24], and introduces vagueness on actions described in natural language [23]. Existing works and benchmarks either fail to provide promising results on real-life, complex applications [22] or completely rely on users to design and provide actions.
Concurrently, there is a body of literature in the field of web automation and robotic process automation that involves the use of symbolic algorithms [2], [7], [19]. These methods have shown promising results in narrow domains and have been implemented in various products, ranging from productivity software [9] to computer-aided design [20]. The main challenge in these applications is the difficulty in generalization. Often, a specialized formal specification needs to be designed for a specific problem, and the heuristics used to solve one problem do not translate well to others.
Both sides advocate for a hybrid approach, which involves combining a neural component and a symbolic component, a nascent field in its early stages of development:
- We can define and model complex application structures, beyond simple token sequences, from first principles. They are compatible with both a symbolic algorithm and a neural network, where actions and action sequences are first-class citizens.
- We can achieve explainability, fast inference, and simplicity of a heuristic (which performs actions to satisfy user intentions) by leveraging a symbolic algorithm [2].
- We can benefit from the language, vision, and zero-shot reasoning capabilities of a neural network [10], which improve as more data becomes available [3].
We are optimistic that by investing in research efforts and computational resources, aided by similar preceding works in respective fields, this line of neuro-symbolic research could yield fruitful results. These fresh perspectives allow us to develop unique formulations and models that are surprisingly effective on the benchmarks we care about.
To assist with the new model, we designed the technical stack from the ground up, from the data collection platform to a new network architecture that utilizes both transformer-style attention and graph-based message passing, combined with program synthesizers that are demonstration and example-guided [21].
learning actions, by demonstration
LAM‘s modeling approach is rooted in imitation, or learning by demonstration: it observes a human using the interface and aims to reliably replicate the process, even if the interface is presented differently or slightly changed. Instead of having a black-box model uncontrollably outputting actions and adapting to the application during inference, LAM‘s "recipe" is more observable.This means that once the demonstration is provided, the synthesized routine runs directly on the target application without the need for a busy loop of "observation" or "thoughts," and any technically trained human should be able to inspect the "recipe" and reason about its inner workings. As LAM accumulates knowledge from demonstrations over time, it gains a deep understanding of every aspect of an interface exposed by an application and creates a "conceptual blueprint" of the underlying service provided by the application. LAM can be seen as a bridge, connecting users to these services through the application‘s interface.
We believe that in the long run, LAM exhibits its own version of "scaling laws," [3] where the actions it learns can generalize to applications of all kinds, even generative ones. Over time, LAM could become increasingly helpful in solving complex problems spanning multiple apps that require professional skills to operate.
By utilizing neuro-symbolic techniques in the loop, LAM sits on the very frontier of interdisciplinary scientific research in language modeling (LM), programming languages (PL), and formal methods (FM). Traditionally, the PL/FM community has focused on symbolic techniques — solver technologies that rely on logical principles of induction, deduction, and heuristic search. While these symbolic techniques can be highly explainable and come with strong guarantees, they suffer from a scalability limit. By contrast, recent innovations in the LM community are grounded in machine learning and neural techniques: while highly scalable, they suffer from a lack of explainability and come with no guarantees of the output produced. Inspired by the success of machine learning and neural techniques, the PL/FM community has recently made waves of progress on neuro-symbolic methods: by putting together neural techniques (such as LLM) and symbolic ones, one ends up combining the best parts of both worlds, making the task of creating scalable and explainable learning agents a feasible one. Yet to date, no one has put cutting-edge neuro-symbolic techniques into production — LAM seeks to pioneer this direction.
performing actions responsibly & reliably
LAM does not live in a vacuum. In order to efficiently leverage the power of LAM to execute tasks on behalf of users on dedicated hardware, we designed new platforms to schedule and manage LAM- powered routines. In addition, LAM interacts with applications designed for humans. This means that knowing only how to perform an objective is not sufficient: LAM and their peripheral software need to know how to do it in a humanizing, respectful way. This is further broken down into several key elements, in addition to our core model development, to support our system:
early signs of LAM competitiveness in web navigation tasks
Web environments, along with mobile and desktop environments, are of interest to LAM. Additionally, the task of web navigation assesses a subset of actions that we care about. Although recent web navigation algorithms have shown human-level performance in a simulated environment (MiniWoB++), they still struggle on real websites. When tested on the MindWeb benchmark dataset, the most effective method only achieves an accuracy of 70.8% in locating the target element. Since the latency is not reported in the respective works, we have omitted the column.
We provide a preliminary evaluation of LAM using our internal benchmark. The dataset comprises 283 episodes, consisting of 17 tasks gathered from 14 different real-world websites, including Airbb, Google Flights, Shein, Spotify, and others. We have evaluated both a purely neural approach and a neuro-symbolic approach. While the pure neural- based method demonstrates competitive performance in locating the target element, incorporating the symbolic method significantly improves accuracy and latency.
- Cloud infrastructure to spin up environments for such AI to be able to behave like a human because most software requires its user to behave like a human, however they define and determine it. We have built a special cluster of virtualized environments that can run LAM on consumer applications, whether in testing or production. It provides an advanced level of security and scalability, enabling us to rapidly prototype our foundation model research.
- Hardware-software programming interfaces deliver the multimedia experience of AI-human cooperation because additive bundling of existing protocols is very poorly optimized. We have created our own optimizations for protocols used in multimedia interactions between our users and our operating system, as well as Virtual Network Computing (VNC) protocols for both users assisting rabbits in performing sensitive operations like authentication or payment, and for rabbits learning from user demonstrations.
- A unified standard to test, observe, and iterate the AI along with the product because LAM needs a "gym" to continue to learn and adapt, and requires observations from product usage and external assistance. Our unique formalization of web, desktop, and mobile application structures, along with the actions performed on them, has enabled us to effectively utilize internet-scale scraped data and human feedback to train our models, all while requiring relatively low computational power.
This process highlights our dedication towards responsible deployment of LAM: it is put on a guardrail to produce behaviors that are safe, efficient, and indistinguishable from human behavior, making them a comfortable choice to delegate user interactions.
embodiment in AI-native devices
We believe that intelligence in the hands of the end-user is achievable without heavy client-side computing power. By carefully and securely offloading the majority of computation to data centers, we open up opportunities for ample performance and cost optimizations, making cutting-edge interactive AI experiences extremely affordable. While the neuro-symbolic LAM runs on the cloud, the hardware device interfacing with it does not require expensive and bulky processors, is extremely environmentally friendly, and consumes little power. As the workloads related to LAM continue to consolidate, we envision a path towards purposefully built server-side and edge chips.
outlook
We share the view that the scaling law continues to permeate all aspects of neural systems research, from vision over the past decade [4] to natural language now. We hope to continue this trend with our action model by collecting more data on human actions on computers and designing more scalable architectures for them. Over the long run, we believe that AI will fundamentally transform economically meaningful work and will play an important part in reshaping our infrastructure, starting with the increasingly nontrivial cost of developing and training them. We have confidence in the vast benefits that these large systems may provide in relation to the resources needed to develop them [1]. Understanding actions will in turn help redesign human-machine interactions and can result in more intuitive and useful natural language-powered systems and devices, benefiting all stakeholders, from service providers, consumers, and hardware manufacturers, to software developers.
footnotes
- A plausible use case is operating applications with natural language while driving.
- The main issue with the lack of available APIs is mostly non-technical. Some, like music streaming services, are contractually not allowed to provide a self-serve, non-bespoke API due to licensing agreements with an external party. Others face resource constraints, degraded user experience, long-term strategic uncertainty, and security risks, and will exhibit high business development friction and poor unit economics when negotiating a dedicated API.
reference
[1] R. Sutton, “The bitter lesson,” Mar. 13, 2019[2] S. Barman, “Ringer: web automation by demonstration,” Sigplan Notices, vol. 51, no. 10, pp. 748–764, Oct. 2016, doi: 10.1145/3022671.2984020.[3] J. Kaplan, “Scaling laws for neural language models,” arXiv.org, Jan. 23, 2020.[4] X. Zhai, “Scaling vision transformers,” arXiv.org, Jun. 08, 2021.[5] T. Schick, “Toolformer: Language models can teach themselves to use tools,” arXiv.org, Feb. 09, 2023.[6] E. Z. Liu, “Reinforcement learning on web interfaces using Workflow-Guided exploration,” arXiv.org, Feb. 24, 2018.[7] Q. Chen, “Web Question Answering with Neurosymbolic Program Synthesis,” arXiv.org, Apr. 14, 2021.[8] I. Gur, “A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis,” arXiv.org, Jul. 24, 2023.[9] S. Gulwani, “Automating string processing in spreadsheets using input-output examples,” 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Jan. 2011, doi:10.1145/1926385.1926423.[10] OpenAI, “GPT-4V(ision) system card.”[11] M. R. Morris, “Levels of AGI: Operationalizing progress on the path to AGI,” arXiv.org, Nov. 04, 2023.[12] T. B. Brown, “Language Models are Few-Shot Learners,” arXiv.org, May 28, 2020.[13] A. Radford, “Robust speech recognition via Large-Scale Weak Supervision,” arXiv.org, Dec. 06, 2022.[14] J. Shen, “Natural TTS synthesis by conditioning WaveNet on MEL spectrogram predictions,” arXiv.org, Dec. 16, 2017.[15] Amazon Web Services, “AWS re:Invent 2015 | (MBL310) Alexa Voice Service Under the Hood,” YouTube. Oct. 12, 2015.[16] OpenAI, “ChatGPT plugins.”[17] J. Huang, “Large language models cannot Self-Correct reasoning yet,” arXiv.org, Oct. 03, 2023.[18] S. G. Patil, “Gorilla: Large Language Model Connected with Massive APIs,” arXiv.org, May 24, 2023.[19] M. Raza, “Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference,” 2020 ACM SIGMOD International Conference on Management of Data, May 2020, doi:10.1145/3318464.3380608.[20] J. Feser, “Metric Program synthesis,” arXiv.org, Jun. 13, 2022.[21] A. Solar-Lezama, “The sketching approach to program synthesis,” in Lecture Notes in Computer Science, 2009, pp. 4–13. doi:10.1007/978-3-642-10672-9_3.[23] P. Sodhi, “HeaP: Hierarchical Policies for Web Actions using LLMs,” arXiv.org, Oct. 05, 2023.[24] I. Gur, “Understanding HTML with Large Language Models,” arXiv.org, Oct. 08, 2022.[25] S. Geisler, “Transformers meet directed graphs,” arXiv.org, Jan. 31, 2023.[26] X. Deng, “Mind2Web: towards a generalist agent for the web,” arXiv.org, Jun. 09, 2023.[27] H. Furuta, “Multimodal Web Navigation with Instruction-Finetuned Foundation Models,” arXiv.org, May 19, 2023.[28] L. Zheng, “Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control,” arXiv.org, Jun. 13, 2023.
citation
if you find the work useful in your research, please consider citing:
@misc{rabbit2023,
author = {rabbit research team},
title = {Learning human actions on computer applications},
url = {https://rabbit.tech/research},
year = {2023}
}