Chat with UAV -- Human-UAV Interaction Based on Large Language Models
The future of UAV interaction systems is evolving from engineer-driven to user-driven, aiming to replace traditional predefined Human-UAV Interaction designs. This shift focuses on enabling more personalized task planning and design, thereby achievin…
Authors: Haoran Wang, Zhuohang Chen, Guang Li
Chat with UA V – Human-UA V In teraction Based on Large Language Mo dels Haoran W ang 1,4 , Zh uohang Chen 1,2 , Guang Li 1,2 , Bo Ma 2* , Ch uanhuang Li 3 1 Sc ho ol of Engineering and Informatics, Univ ersity of Sussex, F almer, Brigh ton, BN1 9RH, England. 2*,3 Sc ho ol of Information and Electronic Engineering, Zhejiang Gongshang Univ ersity , Hangzhou, 310018, Zhejiang, China. 4 Departmen t of Electrical and Computer Engineering, Universit y of Auc kland, Auc kland, 1142, New Zealand. *Corresp onding author(s). E-mail(s): mabo@mail.zjgsu.edu.cn ; Abstract The future of UA V interaction systems is evolving from engineer-driven to user-driven, aiming to replace traditional predefined Human-UA V In teraction designs. This shift fo cuses on enabling more p ersonalized task planning and design, thereby achieving a higher quality of interaction exp erience and greater flexibility , which can b e used in man y fileds, such as agriculture, aerial photograph y , logistics, and environmen tal monitoring. How ever, due to the lac k of a common language betw een users and the UA Vs, suc h interactions are often difficult to b e achiev ed. The dev elopments of Large Language Mo dels p ossess the abilit y to understand nature languages and Rob ots’ (UA Vs’) b eha viors, marking the p ossibility of p ersonalized Human-UA V Interaction. Recen tly , some HUI frameworks based on LLMs ha ve b een prop osed, but they commonly suffer from difficulties in mixed task planning and execution, leading to low adaptability in complex scenarios. In this pap er, w e prop ose a no v el dual-agen t HUI framework. This framew ork constructs t wo independent LLM agents (a task planning agen t, and an execution agent) and applies differen t Prompt Engineering to separately handle the understanding, planning, and execution of tasks. T o v erify the effectiv eness and performance of the framework, w e hav e built a task database cov ering four t ypical application scenarios of UA Vs and quan tified the performance of the HUI framew ork using three indep enden t metrics. Meanwhile differen t LLM mo dels are selected to control the UA Vs with compared p erformance. Our user study exp erimental results demonstrate that the framew ork improv es the smo othness of HUI and the flexibilit y of task execution in the tasks scenario we set up, effectiv ely meeting users’ personalized needs. Keyw ords: Prompt Engineering, UA V, LLMs, Agent, Human-UA V Interaction 1 In tro duction With the increasing p opularit y of Unmanned Aerial V ehicles (UA Vs) in mo dern so ciet y , the com- plexit y of Human-UA V Interaction (HUI) is also escalating. According to researc h b y Rearc h and Mark ets, the world’s largest market research organization, the global UA Vs market will reach a v alue of up to € 53.1 billion b y 2025 [ 1 ]. The Civil Aviation Administration of China published [ 2 ] that the n umber of registered UA Vs in China had reached 1.2627 million by the end of 2023, representing a 32.2% increase from 2022. Clearly , the rapid proliferation of UA Vs in human so ciet y has made the need to low er the threshold for HUI ev en more urgen t. It is b ecoming more imp ortant that exploring 1 ho w to control UA Vs through diversified user interfaces and interaction mechanisms, and conducting in-depth researc h on more complex interaction designs from the p erspectives of p ersonalization, pri- v acy protection, naturalness of in teraction, security , and h umanistic care [ 3 ]. This trend underscores the increasingly refined nature of HUI. T raditional con trol sc hemes are designed to meet the predefined, fixed HUI or task planning requiremen ts set b y engineers. F or example, paper [ 4 , 5 ] used neural netw orks to recognize h uman gestures for the purp ose of enabling HUI. Ra jappa et al [ 6 ] obtained human-transmitted signals by applying external forces to the drone and allo wing its sensors to recognize the applied forces. In widely used agricultural UA Vs, each task is executed based on task routes and parameters set b y engineers, whic h are suitable for static scenarios. Ho wev er, suc h con trol sc hemes ha ve limited capabilities in han- dling complex tasks and often require human interv ention to meet demands. T o address reliability and system manageability , classical rob otics research has dev elop ed three-lay er architectures [ 7 – 9 ], whic h decomp ose capabilities hierarchically: a task-planning lay er formulates goals, a b eha vior-control la yer sequences predefined actions, and an execution lay er interfaces with hardware. These architectures p erform stably in domains such as UA V safety [ 9 ] and physical HUI [ 7 ]. How ever, their task decom- p osition relies on hard-co ded logic suc h as state machines, limiting their adaptabilit y . F or example, when a user requests “av oid birds and insp ect roof crac ks,” such arc hitectures would t ypically require man ually redefining state transitions and up dating control graphs—an inflexible and lab or-in tensive pro cess. T o impro ve mo dularity and reusability , skill-based arc hitectures [ 6 , 10 ] encapsulate capabilities in to callable mo dules. These frameworks supp ort task comp osition by assem bling existing skills, such as Dong et al.’s marine disturbance con troller [ 10 ]. Nevertheless, they often demand retraining of parametric mo dels to accommo date new skills, leading to scalability and generalization challenges. Therefore, future HUI framework is evolving tow ards user-driv en p ersonalized task directions, aiming to provide high-quality HUI exp eriences and enhance the flexibilit y of task planning [ 11 , 12 ]. Nev ertheless, achieving user-driv en p ersonalized goals through traditional metho ds is challenging due to the lack of a common language betw een users and UA Vs. This absence of a common language manifests in t wo asp ects. Firstly , due to the complexity and abstraction of programming languages, h umans often find it difficult to mo del and solve real-w orld problems using co de, esp ecially for non- programmers. Th us, even if users understand the essence of a problem, they ma y struggle to translate it in to a format that UA Vs or other automated framework can execute. Secondly , h umans cannot directly communicate with algorithms, machines, or programs, as they op erate based on logical rules and data inputs, rather than natural language or intuitiv e understanding. Consequently , users may not b e able to conv ey specific instructions or requests in a w ay that UA Vs can understand and respond to, which can lead to misunderstandings, errors, and ultimately , a sub optimal lev el of p ersonalization and effectiv eness in HUI. F ortunately , with the emergence of ChatGPT, Large Language Mo dels (LLMs) hav e developed rapidly , and the adven t of Llama [ 13 ] has made it a reality for individuals to deploy LLMs lo cally , the emergence of LLMs p oin ts to a solution to provide a common language or a swift transform b et ween h uman and machines. T rained on massive amounts of textual data, LLMs p ossess p o werful seman tic understanding capabilities, enabling them to deeply comprehend the syntax, semantics, and context of natural language. At the same time, the syntactic standards, logical consistency , and abstraction and mo dularization features of co de allo w LLMs to excel in understanding and generating co de. These tw o characteristics make LLMs capable of serving as a direct bridge b et ween users and UA Vs, marking the p ossibilit y for users to interact directly with UA Vs beyond the predefined settings of engineers. Recently , many HUI frameworks based on LLMs hav e b een prop osed to realize this vision: for example, using ChatGPT directly for task planning [ 14 ] or integrating LLMs into the UA V con trol pro cess through a Python pip eline [ 15 ], leveraging LLMs to abstract co de for users, thereb y enabling them to organize personalized tasks more freely . Liu et al [ 16 ] and Sun [ 17 ] ha ve in tegrated LLMs in to the visual inspection of infrastructure in order to harness the ability of LLMs to comprehend human in tentions and generate control commands. How ever, these single-agent LLM metho ds still encounter planning failures when faced with complex planning challenges. The issues leading to suc h failures lie in: 1. T ask Planning: When single-agent LLM metho ds engage in task planning, they may erro- neously integrate co de or execution scripts into the plan, driv en b y the dual necessity of planning 2 and exe cution, as illustrated in Figure 1 . This incorporation in tro duces errors, as the plan ide- ally should abstract from sp ecific co de or execution intricacies. In the context of task planning, LLMs are exp ected to concen trate on comprehending the task and its context, akin to a ’human’ role. Conv ersely , during task execution, LLMs need to generate precise co de or machine-readable instructions, functioning more like a ’script’. In practical scenarios, achieving a seamless transition b et w een these ’human’ and ’script’ roles within a single LLM run presents a significant challenge. Consequen tly , when confron ted with intricate planning and tasks, single LLM agen t metho ds may encoun ter reduced planning efficiency or execution failures. 2. T ask Execution: F urthermore, when using LLMs for task planning or executing indep enden t tasks, if the complexity of the task exceeds the capabilities of the predefined high-level function library , the existing HUI frameworks based on LLMs [ 15 ] may fail to execute the task, esp ecially for complex op erations like obstacle av oidance. F or UA Vs, obstacle av oidance inv olves an in tricate pro cess, including receiving camera data, inferring surrounding obstacles, and calculating sub- sequen t v elo cit y vectors, which cannot be simplified in to a script. Therefore, in such cases, the in teraction framework must p ossess the ability to in vok e to ols to complete the task. Similarly , for task planning, when the complexity of the task surpasses the solo handling capacit y of LLMs, the framew ork should also hav e the capabilit y to call up on to ols. T o address the aforementioned issues, we prop ose “UA V-GPT,” a dual-agen t framework tailored for UA V interaction. It integrates the natural language understanding capabilities of LLMs via API in terfaces, enabling UA Vs to “understand” h uman language and learn to in vok e tools to tackle com- plex task planning and execution challenges. The framework consists of tw o LLM agents: a planning agen t and an execution agent. T o solv e the task planning problem, the planning agen t utilizes the nat- ural language understanding abilities of LLMs to classify and plan tasks, accurately categorizing user needs and formulating reasonable execution plans through a predefined behavior library and discrete task reorganization. It then con veys the classified and planned tasks to the execution agent. T o solve the task execution problem, execution agent will select the most suitable metho d and executes the corresp onding op erational co de based on the mapping relationship b et ween a predefined command library and the LLM, ensuring precise and efficient execution of every instruction. T o enhance gen- eralization, w e integrated ROS-based control algorithms which fall within the scop e of the skill base arc hitecture, the difference is that we use skills as a supplement to the rob otics framew ork rather than as its foundation. Mean while, within the classical three-lay er architecture, our planning and execution agen ts op erate within the scop e of the task planning and execution lay ers, resp ectiv ely . Crucially , the in tegration of LLMs significantly optimizes these lay ers compared to traditional implemen tations: it emp o w ers the planning lay er with semantic reasoning to transcend hard-co ded logic, and upgrades the execution lay er with co de comprehension to enable dynamic rather than static to ol inv o cation. This ensures that, unlike purely skill-based or traditional architectures, our system achiev es b etter flexibilit y and adaptability . Fig. 1 T ask decomposition divides a complex task into planning and execution phases, whic h differ for LLMs. Using one LLM-agent for both may cause errors, incorp orate execution details into the plan, or conv ersely , insert planning elements into execution details. 3 In summary , our contributions are as follows: • W e prop ose a dual-agen t intelligen t interaction framew ork based on LLMs for UA V, which ac hieves precise classification of tasks and outputs solutions with high success rates and efficiency . Planning agen t emplo ys a predefined b eha vior library and discrete task reorganization to plan complex tasks reasonably , execution agent conv ert the solutions in to precise machine language outputs. • W e in tro duced traditional control algorithms to the execution agen t to enhance its range of applica- bilit y . The execution agent ev aluates tasks through the mapping relationship b etw een a predefined command library and LLMs, and then selects a reasonable execution method to complete the tasks. • The user study demostrate that our framew ork impro ves HUI smo othness and task-execution flexibilit y , enabling b etter supp ort for personalized user needs. • The simulation and real exp eriments results demonstrate that, compared to the HUI framework with a single LLM agent, our framework achiev es an av erage improv ement of 60% in op erational efficiency for complex tasks and a 30% increase in task execution success rate. In the ov erview of Section 3 , we first briefly explained the limitations of the traditional HUI for UA Vs, then describ e the sp ecific arc hitectural approach of UA V-GPT for the HUI scheme, where the interaction requiremen ts are obtained from users through simulated requests. In Section 3.1 , w e construct a database related to daily interaction tasks with UA Vs. W e hav e defined four task categories for the database and pro vided detailed explanations for the classification rationale. Within eac h task category , we hav e designed 20 differen t tasks that encompass simple mo vemen ts to con tinuous complex route planning and optimal solutions. Simultaneously , we quantify the p erformance of different HUI framew orks on this database using three indicators, whic h allows for a clearer demonstration of the adv antages of our framework. In Sections 3.2 and 3.3 , w e pro vide detailed explanations of the methods at b oth agents of the interaction framework. The o verall architecture is primarily implemented using ERNIE 4.0 and ev aluated through p erformance analysis. 2 Related w ork 2.1 T raditional Human-Rob ot Interaction Metho ds Ph ysical Human-Rob ot in teraction (PHRI) refers to direct and intuitiv e comm unication b et w een h umans and robots without intermediary mediums, effectiv ely con veying ric h tactile feedback to humans. In the early stages of rob otics, due to the lack of tactile sensing capabilities, traditional in teraction metho ds relied on wearable devices. F or example, Kim et al [ 18 ] developed an exoskeleton master arm capable of detecting torque applied by users and providing m ultimo dal contact feed- bac k, enabling device mobility . Another example is the vibrating motor-controlled bracelet prop osed in [ 19 ], which guides rob ots to follow human tra jectories. Haddadin et al [ 7 ] summarized the latest adv ancements in PHRI, while De Santis et al [ 8 ] fo cused on addressing the safety asp ects of PHRI. Ho wev er, these in teraction methods face a common limitation: the interaction distance is t ypi- cally fixed and confined to a small range. Moreov er, the forms of robots capable of interaction are limited, restricting effective in teraction with non-anthropomorphic rob ots such as UA Vs. T o address this challenge, researchers hav e prop osed “T eleop eration Human-Rob ot Interaction (THRI).” In the follo wing, we provide an ov erview of this area. T eleop eration Human-Rob ot Interaction leverages the concept of indirect control, where com- mands are transmitted through wireless communication channels. This significantly enhances the effectiv e distance range in h uman-machine in teraction. F or example, Peppoloni et al [ 20 ] designed a R OS-integrated interface that enables users to remotely control robots via hand gestures. Similarly , Tsetseruk ou et al [ 21 ] prop osed a metho d for teleop erating rob otic arms using full-b ody motion. Su et al [ 22 ] improv ed traditional teleop eration by com bining 2D image transmission with a 3D virtual in terface, enhancing spatial aw areness during remote control. F ani et al [ 23 ] explored wearable-based in teraction, offering a simpler and more intuitiv e exp erience that impro ves resp onsiv eness. These studies hav e greatly extended the restricted interaction distances and changed the forms of rob otic agents at the execution end in physical human-robot in teraction. How ev er, they also intro- duced new c hallenges. When a user engages in remote control of a rob ot, it is imperative that the rob ot operates within a size or scale that is commensurate with the user’s environmen t, or adheres to predefined system scales. How ev er, the inherent v ariability in environmen tal dimensions precludes an 4 exact correspondence, thereb y necessitating user reliance on these predetermined scales. This reliance, in turn, imp oses limitations on the efficacy of remote con trol in practical, real-world scenarios. Hence, a p ertinent question arises: Can the integration of adv anced technologies, sp ecifically LLM-based Human-Rob ot Interaction metho ds, emp o wer the execution end (i.e., the rob ot) with the capacity to in terpret and execute commands in a manner that facilitates more p ersonalized resp onses tailored to user-sp ecific requirements? This exploration aims to transcend the limitations imp osed by reliance on predefined scales and enhance the effectiveness of remote control in real-world applications. Sup ervisory and Interactiv e Systems in UA V Operations: Sup ervisory control arc hitectures enable strategic h uman o versigh t in UA V op erations through dynamic task reconfiguration and hier- arc hical monitoring. W ang et al [ 24 ] mitigate adv ersarial environmen tal uncertain ties by formalizing sw arm task planning as a Constrained Marko v Decision Pro cess (CMDP). Their p erformance- function-guided SA C-Lagrangian algorithm incorporates safety b oundaries (e.g., no-fly zones) in to optimization ob jectiv es, ac hieving 96% mission success rates under electronic warfare conditions. Dong et al [ 10 ] dev elop ed a bio-inspired sliding mo de controller with radial basis function neural net- w orks (RBFNN) for marine rescue drones, using Multi-La yer P erceptron (MLP) to comp ensate for o cean disturbances and thruster faults. This approac h ensures tra jectory stability under wind and w av e p erturbations while reducing computational complexity , v alidated through Lyapuno v stability pro ofs. As for the ground station design, standardization and modularity are critical for scalable UA V command interfaces. KP Arnold et al [ 25 ] primarily examine the t yp es, comp onen ts, safety features, redundancy design, and future applications of UA V Ground Control Stations (GCS). E C ¸ in ta¸ s et al [ 26 ] primarily examine how to achiev e visual tracking of mo ving targets using UA Vs equipp ed with lo w-cost hardware and nov el GCS. The core ob jective of this system is to develop a visual trac king system capable of op erating efficiently in real-time applications, particularly within complex en vironments featuring non-fixed p ersp ectiv es and dynamic target motion. Mission planning integrates path optimization with threat mitigation optimization with threat mitigation. Xiong et al [ 27 ] combined adaptive genetic algorithms (A GA) and sine-cosine parti- cle sw arm optimization (SCPSO) for m ulti-drone disaster rescue, dynamically assigning tasks while generating collision-free 3D paths around collapsed structures. Security frameworks must coun ter emerging threats: Castrillo et al [ 9 ] established lay ered proto cols including MA VLink encryption, RF signal fingerprinting to detect rogue drones, and GPS sp o ofing coun termeasures deploy ed in critical infrastructure protection. This dual focus on op erational efficiency and electronic securit y addresses gaps in remote UA V sup ervision. 2.2 Classical Rob otic Metho ds Classical rob otics arc hitectures (e.g., Three-Lay er [ 7 – 9 ]) Classical rob otics arc hitectures, such as the Three-La yer Architecture [ 7 – 9 ], ensure system reliability by structuring tasks into hierarchical la yers. The task-planning lay er decomposes complex goals in to simpler sub-tasks, the behavior- con trol lay er schedules predefined actions, and the execution lay er con trols hardw are in terfaces. These arc hitectures ha ve pro ven effective in domains like UA V security [ 9 ] and ph ysical h uman-rob ot inter- action [ 7 ], where reliability and stability are paramount. How ever, they t ypically rely on predefined state mac hines or fixed task decompositions, whic h can be limiting when resp onding to more dynamic user instructions. F or example, when users pro vide instructions lik e “av oid birds and inspect ro of crac ks,” traditional three-lay er systems require manual redesign of state transitions to incorporate new tasks. Skill-based architectures [ 6 , 10 ] mo dularize capabilities b y encapsulating them into reusable skill mo dules, making the system more flexible and adaptable. These framew orks allow for easy comp osi- tion of skills to p erform v arious tasks. F or example, Dong et al [ 10 ] developed a marine disturbance con troller that can be used to address specific environmen tal challenges. Ho wev er, extending these framew orks with new skills often requires retraining or reconfiguring parametric mo dels, which can b e resource-intensiv e. Relativ e to these frameworks, our planning and execution agents map to the task-planning and execution lay ers, resp ectiv ely , with ROS based control algorithms serving as a supplemen t to the execution agent. The integration of LLMs impro ves in teraction flexibilit y and generalization compared to these framew orks. 5 2.3 LLM-based Human-Rob ot Interaction metho ds Dialogue, as a natural mo de of human communication, con tributes to integrating rob ots into human so ciet y in the context of HRI. The emergence of LLMs with adv anced natural language understanding capabilities enables them to effectively assist rob ots in understanding scale differences in v arious en vironments. They also facilitate the integration and transmission of tasks from the control side to the execution side (robots). Human-Rob ot : W ang et al [ 28 ] directly utilized LLMs to generate action commands, but significan t deviations existed. Mai et al [ 29 ] using Large-scale Language Mo del as a rob otic brain to unify ego cen tric memory and con trol. Mow er et al [ 30 ] extracted actions from the output of LLM and executed ROS op erations/services. Ishik a Singh et al in [ 14 ] prop osed a co de framework for task planning using ChatGPT. The core of these articles are to provide program specifications of av ailable actions and ob jects in the environmen t to the LLMs, which then plan all p ossible actions. Their exp erimen t confirmed the feasibility of this approach. Since the task processes provided by LLMs are not classified, they can only transmit simple, encapsulated lo w-level control functions to LLMs, inform them of sp ecific scenarios and tasks that need to b e executed, and then let LLMs process the task flo ws and issue commands to the execution end. In contrast, Anis Koubaa et al. c hose to integrate LLMs with ROS2. In [ 31 ], they explored the metho d of in tegrating ChatGPT into ROS2 and successfully designed a ROS2 pack age that could con vert human language instructions into navigation commands for ROS2 rob ots. This pack age- based adaptation metho d can b e more universally applied to other rob ot pro jects, th us ha ving higher compatibilit y . How ever, their metho d only differs from [ 14 ] in terms of the path implementation. The classification of tasks and the execution of complex tasks hav e not b een p erfectly addressed. Human-UA V : Regarding the in teraction b et ween UA Vs and LLMs, the challenges w e face are similar to those discussed in the previous tw o articles. Jav aid et al [ 32 ] hav e inv estigated the feasibility of com bining UA Vs with LLMs. With the rapid developmen t of LLMs. Sai V emprala et al [ 15 ] utilized ChatGPT to ac hieve UA V control and path planning. Regarding the issue of UA V mission planning, they also demonstrated simple autonomous obstacle av oidance b eha viors controlled by ChatGPT in AirSim sim ulations. Their work demonstrated the efficiency of using the ChatGPT language mo del for rob ot con trol. Cui et al [ 33 ] emplo yed LLMs to plan UA V missions, utilizing a single natural language input to command m ultiple UA Vs in b oth synchronous and asynchronous mo des. Lyko v et al [ 34 ] hav e achiev ed rapid swarm con trol of UA Vs utilizing GPT. The describ ed metho d enables in tuitive orchestration of swarms of any size to achiev e desired geometric shap es. Ho wev er, since they only used the UA V SDK control function (i.e., controlled b y Python scripts) without in tegrating the R OS platform, and they only used a single LLM-agen t to sim ultaneously han- dle task planning and execution, using pre-prompts for function learning in LLMs without separating planning and execution, their architecture could not accurately implement the planning and execu- tion of long and complex tasks. Mohamed Lamine T azir in reference [ 35 ], impro v ed the hardware equipmen t and used ChatGPT to transmit control information to UA Vs based on PX4/Gazeb o sim- ulations, successfully con trolling the UA V’s operations. This was a very promising attempt b ecause it com bined ROS and ChatGPT in their system, but issues with task classification and long-pro cess planning still p ersist. If we wan t LLMs to demonstrate their assistive capabilities in more scenarios, suc h as obstacle av oidance, path planning, or longer task planning, we need m ultiple LLM-agen ts and a deep er integration of ChatGPT into ROS no des. 6 T able 1 Comparison of Metho ds: Here, w e clearly demonstrate the adv antages of our metho d compared to traditional physical and teleop eration HUI, as well as single-agent HUI based on LLM. HUI F orce Reflected [ 18 ] ROS- Integrated F rame- work [ 20 ] Prog- Prompt [ 14 ] ROS- LLM [ 30 , 31 ] Prompt Craft [ 15 ] W ords to Flight [ 35 ] Ours Remote Control ✗ ✓ ✓ ✓ ✓ ✓ ✓ Personalized T asks ✗ ✗ ✓ ✓ ✓ ✓ ✓ High-Level Plan ✗ ✗ ✗ ✗ ✗ ✗ ✓ T ools Ability ✗ ✗ ✗ ✗ ✗ ✗ ✓ 3 Metho ds Ov erview Fig. 2 T raditional teleop eration control metho d The traditional method for HUI with UA Vs is depicted in Figure 2 . It primarily hinges on users’ abilit y to abstract real-w orld tasks in to co ding represen tations and subsequen tly transmit these to the UA V via wireless links. Inherently , this op erational pro cedure presen ts an obstacle, as not all users o wn the capabilit y to translate real-w orld tasks into co de. This is precisely why it is necessary to in tro duce LLMs into this pro cess. T o more concisely and clearly describ e UA V-GPT, we characterize it as a conv erter USR2ML based on LLMs that takes users’ sp eech/text as input and conv erts it into executable mac hine language vectors (ML V): l = ( l 1 , l 2 , l 3 ..., l T ) (1) Where l denotes the length of the Mac hine Language V ector (ML V). F or each time step t ∈ 1 , ..., T , the output length l t should b e k ept within a reasonable range, as the UA V is required to execute the complete instruction in a single step. If the output is too long, it may result in in teraction failure due to system o verload or motion deviation. In particular, when the execution length exceeds a certain threshold, the UA V’s tra jectory may exhibit significant drift caused by cumulativ e errors from on b oard sensors and fligh t controllers. Since such drift is closely tied to hardw are limitations—which are b ey ond the scop e of this study—we define a reasonable range of execution length through three p oin ts to minimize the impact of hardware-induced errors and ensure task reliability . This reasonable range is formally defined as a constraint ensuring UA Vs execute complete ML Vs without exceeding the system’s ph ysical or computational capacity . This constraint is determined by: • Pro cessor limitations (T ello SDK buffer capacity) 7 • Real-time comm unication stability (WiFi pack et size thresholds) • Safet y proto cols (max command chain before failsafe activ ation) Exp erimen tally , we set l min = 3 (e.g., tak eoff → hov er → land) and l max = 7. When UA V-GPT outputs ML V (Machine Language V ector), w e will detect its length. F or outputs with a length e xceeding l max , we consider the system’s task execution as failed. The reason is that the task language receiv ed by the UA V should b e within a sp ecified range; otherwise, it may fail to execute due to hardware resource constraints. Meanwhile, for tasks with longer processes, one of the design goals of the planner is to split the o verall task in to m ultiple sub-segments that can be executed in a single run. Therefore, if the output length exceeds l max , it can b e regarded as a failure of the planner in task division. T o achiev e users’ requirements, UA V-GPT first needs to classify the user’s task. The application scenarios of UA Vs in reality are complex and v ariable, and iden tifying the accurate task t yp e will significan tly enhance the success rate and efficiency of the agent. Next, it needs to solv e the problem and output a machine language vector that meets the requirements. T o achiev e this goal, we prop ose a dual-agent arc hitecture, as shown in Figure 4 . The first stage of UA V-GPT is a planning agen t, whose task is to conv ert user inten tions into fixed categories of tasks and reorganize the optimal solutions through a predefined b eha vior library and discrete task reorganization. The second stage is a machine language execution agent, whose task is to ev aluate the task flow and output machine language within a reasonable range by leveraging the mapping b et ween the LLM and the predefined instruction library . 3.1 T ask Classification In the task classification framework, w e adopt a tw o-dimensional tagging system as shown in T able 2 , where the first core dimension is the “Simple-Complex” contin uum. This dimension uses the follo wing tw o sub-dimensions to c haracterize: • State Space Complexit y S c : It is jointly determined by the num b er of monitoring p oin ts p ∈ C t and the range of dangerous areas d ∈ A , where the C t is the con tent of task and A is the space of task in System Prompt(which is the preset system knowledge of the current task scene), reflecting the amoun t of p erceptual information that needs to b e focused on for the task. • Motion Space Complexity M c : It is determined by the length l ∈ C t of the action sequence, reflecting the reasoning depth required for the large language mo del to solv e the problem. In human-computer interaction scenarios, an increase in the num b er of task monitoring p oin ts p and the range of dangerous areas d means more optional paths and e xecution conditions, whic h directly exacerbates the complexity of the task [ 36 ]. Meanwhile, cognitive load theory [ 37 ] indicates that a larger n umber of monitoring p oin ts and an expanded range of dangerous areas will cause LLMs to pro cess ric her context of task C t and p erform more intermediate reasoning steps b efore executing the task, ultimately increasing the complexity of the task. Next, w e define the ov erall task complexit y as a linear combination: C task = α · S c + β · M c , (2) here, S c = γ p · p + γ d · d , whic h is the state space complexity determined by the num b er of monitoring p oin ts p and the range of dangerous areas d . M c = γ a · l , whic h is the motion space complexit y determined by the action sequence length l required by the task. Figure 3 shows how we decide these v alues during a task. Note that our tasks are strictly limited to a 50 × 50 × 50 meters volume. This prev ents conflicts where simple tasks requiring long-distance execution receive an inaccurately lo w C task relativ e to their actual complexit y . Instructions m ust also b e explicit and non-autonomous to ensure that p and q remain computable, even when their v alues are zero. F or example, high-autonomy tasks “search for a sp ecific lo cation” are excluded due to the infeasibilit y of defining p . In contrast, commands “take off and mo ve forw ard 5 meters” remain v alid for setting p and q to zero simply shifts the final complexit y C task to action sequence length l without compromising classification accuracy . 8 Scene Fig. 3 The red parts represen t dangerous areas, the y ellow parts represen t action sequences, and the blue parts represent monitoring p oin ts. The co efficients α and β control the balance b et ween p erception and reasoning burdens, while γ p , γ d , and γ a serv e as unit scaling factors. Based on this score, w e define a tw o-lev el classification rule: T ask Type = ( Simple , if C task ≤ θ Complex , if C task > θ (3) the threshold θ is an empirical or learned parameter that reflects the separation b oundary in the “simple-complex” contin uum. T o determine the co efficien ts γ p , γ d , γ l and θ inv olved in the task com- plexit y F ormulation 2 , we utilize tw o exp ert-annotated datasets with high/low complexit y task lab els: CLAD [ 38 ] and BRMData [ 39 ]. By analyzing the p , d , and l within these datasets and correlating them with the provided task complexity lab els, w e estimate the relative w eights of each factor. In this work, we empirically set the balance co efficients as α = β = 0 . 5, assuming equal contributions from state space complexit y and motion space complexity for simplicity . F or the second dimension of T able 2 , “Independent vs. T o ol-assisted” , w e adopt a kno wledge- based decision mec hanism. Sp ecifically , the system incorp orates tw o types of prior kno wledge: • In ternet Kno wledge refers to general, preloaded kno wledge that forms the foundational information base of the rob ot; • System Knowledge refers to task-sp ecific kno wledge dynamically generated or loaded for the curren t interaction, including contextual and environmen tal information relev ant to the scenario. During execution, we extract task-related keyw ords K from the input instruction C t : K = { k 1 , k 2 , ..., k n } where k i ∈ Keywords( C t ), and match them against the k eyw ords present in system kno wledge and In ternet knowledge: K total = K sy s ∪ K net . If the extracted keyw ords can b e fully matc hed with the system knowledge, the task is classified as Indep enden t, as it can b e handled with existing system information: ∀ k ∈ K, k ∈ K total ⇒ Independent. Otherwise, if the keyw ords require additional unkno wn to ols or capabilities b ey ond the knowledge bases, the task is lab eled as T o ol-Assisted, implying dep endence on external mo dules or resources: ∃ k ∈ K, k ∈ K total ⇒ T o ol − Assisted. This classification approach allows the system to dynamically determine task autonomy based on the semantic alignment betw een user instructions and a v ailable kno wledge, thereb y bridging high-lev el language understanding with execution-level resource requirements. The generalizability of the pro- p osed framework is currently b ounded by the calibration dataset used during the co efficient learning phase. Specifically , the classification of UA V interaction tasks relies on the co efficien t ranges derived from this dataset. Therefore, the broader and more diverse the calibration data, the wider the range of tasks the system can effectiv ely supp ort. This design ensures a balance betw een interpretabilit y and scalability , allowing the framework to gradually extend its capability as more represen tative task data becomes av ailable. In our implemen tation, w e hav e made efforts to enric h the calibration dataset and ensure that the resulting classification co efficien ts are b oth reasonable and robust. By combining these t wo dimensions, we derive four distinct task types: SI,ST,CI,CT . 9 T able 2 Classification T ags Simple Complex Independent Mov e forward 5 meters and take a picture Mov e forward 5 meters then take pictures for kitchen and tw o b edrooms T ool-assisted Mov e forward 5 meters and av oid obstacles in time Mov e forward 5 meters then take pictures for the kitchen and tw o bedro oms and av oid obstacles in time The planning agent of the UA V-GPT framew ork is resp onsible for comprehending user inten t and accurately classifying tasks in to the first dimension. The execution agent, meanwhile, need to decide if the task is indep endent, and c ho ose the righ t metho d to do the task through the 2-dimensional classification, translates these classified tasks and plans into mac hine language for ultimate execution. In designing the framework, we leverage three key capabilities of LLMs: Firstly , conv erting real- w orld task descriptions in to co de tasks. Secondly , identifying task types. Thirdly , assessing tasks and selecting optimal implemen tation approaches through LLM’s reasoning abilities. T o ev aluate the p erformance of the tw o agen ts within the UA V-GPT framework, w e employ ed three p erformance metrics: In tent Recognition Accuracy (IRA), T ask Execution Success Rate (ESR), and UA V Energy Consumption (UEC) for qualitativ e testing. M ac hi n e L an g u age V ect or SYS T EM I NP UT Independent Tasks Tool - Bas ed Ta sks L L M Ag en t I sh ou ld u se som e c om p lex t o ols t o acc om p lish t h e t ask . P re d efi n ed c ode is in s u f f ic ien t t o m ee t t h e t as k ; I n e ed t o c alc u lat e som e d at a m y self . I sh ou ld u se som e c om p lex t o ols t o acc om p lish t h e t ask . I sh ou ld u se som e c om p lex t o ols t o acc om p lish t h e t ask . Independent Tasks Tool - Bas ed Ta sks L L M Ag en t I sh ou ld u se som e c om p lex t o ols t o acc om p lish t h e t ask . P re d efi n ed c ode is in s u f f ic ien t t o m ee t t h e t as k ; I n e ed t o c alc u lat e som e d at a m y self . I sh ou ld u se som e c om p lex t o ols t o acc om p lish t h e t ask . L L M - B as e d Tas ks Cl as s ifi cation a nd Pla nni ng Ag e nt L L M - B as e d Tas ks Exe cutio n Ag ent U se r s Inp ut s To ol Kn owle dge Sy stem Kn owle dge Int er net Kn ow le dg e Ca t egor y an d P la n M ac hi n e L an g u age V ect or U SE RS TA S K S U SE RS TA S K S SYS T EM I NP UT U SE RS TA S K S SYS T EM I NP UT M ake t h e d r o n e t ake o f f, m o v e for war d 5 m e te r s , asc e n d 5 m e te r s, an d t h e n l an d . Ha v e t h e d r on e f l y i n a ci r cu l a r t r a j e ct o r y f or m e Ha v e t h e d r on e f l y i n a ci r cu l a r t r a j e ct o r y f or m e Go t o t h e loc ation o f t h e c o l a, tak e a p i c tu r e fo r m e , an d b e awar e o f th e o b stac l e s . Go t o t h e loc ation o f t h e c o l a, tak e a p i c tu r e fo r m e , an d b e awar e o f th e o b stac l e s . U SE RS TA S K S SYS T EM I NP UT M ake t h e d r o n e t ake o f f, m o v e for war d 5 m e te r s , asc e n d 5 m e te r s, an d t h e n l an d . Ha v e t h e d r on e f l y i n a ci r cu l a r t r a j e ct o r y f or m e Go t o t h e loc ation o f t h e c o l a, tak e a p i c tu r e fo r m e , an d b e awar e o f th e o b stac l e s . Fig. 4 This is the dual-agent architecture for user’s requests to machine language v ector. 3.2 The design of Planning Agen t The core function of the LLM-based task planning agent is to receive user voice or text instructions and p erform complexit y classification and planning accordingly Figure 4 . T o supp ort this pro cess, a t wo-dimensional task categorization system is introduced into the LLM, along with clearly defined classification rules. This enables the agen t to interpret diverse natural language expressions of the same task and map them to appropriate structured plans. Sp ecifically , the four task types include: 10 simple indep enden t tasks (e.g., “mov e forward 10 meters and then ascend”), simple to ol-assisted tasks (e.g., “mov e forward 10 meters while a voiding obstacles in real-time”), complex indep enden t tasks (e.g., “go to the kitchen, bathro om, and living ro om to tak e photos and return”), and complex to ol-assisted tasks (e.g., “go to the kitchen, bathro om, and living ro om to tak e photos and return, a voiding obstacles in real-time during the journey”). T o enhance the recognition capability of the interaction framew ork, we hav e designed a sp ecific guidance path for the planning agent. This path requires the LLM to extract key information from user inputs: Scene Keywords and T ask Actions. The LLM infers the p ossibility of task planning sce- narios based on Scene Keywords and differentiates execution methods through T ask Actions. These t wo pieces of information corresp ond to the first and second dimensions of task classification, resp ec- tiv ely . This guidance scheme enables the LLM to classify the tw o indicators separately , significantly impro ving classification accuracy . Regarding task planning, there are notable differences b etw een the planning and execution agents. The planning agen t addresses the one-dimensional range issue in task classification, employing distinct planning approac hes for tasks of v arying complexities. F or simple tasks, the LLM describ es the action sequence the UA V needs to p erform in precise natural language. F or complex tasks, the LLM utilizes to ols such as pre-trained neural netw orks or fine-tuned information to plan for optimal success rates and low energy consumption. In summary , the design of the planning agent encompasses selecting an LLM mo del (in this case, qianw en), utilizing prompts to design the guidance structure, and creating a carefully crafted 2D task classification list with w ell-structured con text. This setup enables the planning agen t to effectively accomplish classification and planning tasks within the UA V-GPT framew ork. T o ev aluate the p erformance of the planning agent in terms of IRA and UEC , we hav e constructed a UA V-GPT database for exp erimen tal testing. 3.3 The Design of Execution Agen t A common approach to tackling task execution for the UA V is to rely solely on an LLM agent, whic h utilizes LLMs suc h as qianw en or LLama3 to con vert naturally described task workflo ws in to machine language (co de) comprehensible b y the UA V, subsequen tly transmitting it via remote communication facilities. F or instance, the CaP F ramew ork [ 40 ] adopts this metho d. CaP is a rob ot-cen tric language mo del, generates programmatic represen tations capable of yielding reactive p olicies (lik e imp edance con trollers) and wa yp oin t-based strategies (e.g., vision-based grasping and placing, tra jectory-based con trol). The core of the CaP approach lies in linking classical logical structures, incorporating third-part y libraries (e.g., NumPy , Shapely) for arithmetic op erations, and allowing LLMs to receiv e commands and autonomously recombine API calls to pro duce no v el strategy co de. While promis- ing, this metho d is constrained by its execution format, generating script-based strategies for rob ots to execute, limiting real-time interaction with rob ots during task execution and thus insufficien t for tasks lik e real-time obstacle av oidance. Therefore, building up on our execution agent arc hitecture, we preserv ed the LLM’s capability to translate natural language into machine language while introducing to ol inv o cation abilities. As illustrated in Algorithm 1 , after the user provides an instruction, the system first computes the task complexity based on monitoring p oin ts p , danger range d and input length l . The Planner Agen t then will determine which kind of task type it is and generates a corresp onding action plan. F or the the Execution Agen t, it is equipped with a neural net work-based R GB image recognition framew ork integrated with egoplanner for mono cular real-time obstacle a voidance (i.e., path planning functionalit y) in Figure 5 and it will decide how to execute the task based on the plan and task type from the planner agen t. W e encapsulate the API-based LLM inv o cation b ehavior into a light weigh t R OS pack age. Within the ROS system, the l lm no de publishes instruction information from the API, whic h is subscrib ed to by the EgoPlanner algorithm to determine whether to initiate the path planning pro cess, this framew ork enables mono cular real-time obstacle av oidance (i.e., path planning functionality). In the LLM prompt for the execution agent, w e instruct the LLM to employ pure agent functionality for indep enden t tasks, con verting naturally describ ed tasks into machine language (co de), and to in vok e predefined to ols through ROS for to ol-assisted tasks Figure 4 . T o ev aluate the performance of the execution agen t in terms of T ask Execution Success Rate (ESR) and UA V Energy Consumption (UEC), w e will conduct exp erimen tal tests using the constructed UA V-GPT database. 11 Algorithm 1 Main Lo op for UA V-GPT, p is monitoring p oin ts, d is range of dangerous areas and l is the length of action sequence in user’s input 1: Initialize Planner Agent 2: Load AirSim basic and system prompt 3: while true do 4: input ← Get { User&Pre } Input 5: if input = !quit or input = !exit then 6: ExitPr ogram 7: else if input = !clear then 8: ClearScreen 9: else 10: ( p, d, l ) = LLM plan ( input ) 11: get complexity C task based on Eq 2 12: if C task > θ then 13: taskType1 = Complex 14: else 15: taskType1 = Simple 16: end if 17: Plan = LLM plan ( C t , taskType1 ) 18: OutputPlan ( Plan ) 19: end if 20: end while 21: 22: Initialize Execution Agent 23: Load T o ol basic and system prompt 24: while waiting for T ask T yp e1 and Plan do 25: input1 ← Sys-K onwledge 26: input2 ← In-K onwledge 27: keywords = LLM execut ( C t ) 28: Pre K = input1 + input2 29: if keywords == Pre k then 30: taskType2 = Independent 31: else 32: taskType2 = Tool Based 33: end if 34: if TaskType2 = Independent then 35: action = LLM execut ( Plan ) 36: ExecutePlan (LLM) 37: else 38: action = LLM execut ( Plan ) 39: ExecutePlan (LLM and T o ol) 40: end if 41: end while F or example, regarding a task “Mov e forward 5 meters then take pictures of the kitc hen and t wo b edrooms and av oid obstacles in time”, assuming the system prompt indicates that there are obstacles at the doorwa ys of the kitchen and the b edrooms, the execution pro cess of the system is as follows: 1. Knowledge Matc hing & T ask Classification • Keyw ord Extraction: – Actions: mov e forw ard, take pictures, av oid obstacles – T argets: kitc hen, b edrooms • Kno wledge Base Comparison: – System Know le dge: Predefined na vigation parameters (e.g., egoplanner no de), camera APIs. – Internet Know le dge: UA V Control APIs. • Decision Mechanism: – T o ol-Assisted T ask : av oid obstacles requires real-time obstacle data → T riggers external collision-a voidance mo dule. 2. Executable File Generation • execute egoplanner(target=”kitchen”, obstacle data=env obstacles) • activ ate camera(camera type=”RGB”) 12 Fig. 5 The UA V uses ROS with multiple algorithms to p erform tasks: capturing an RGB image, processing it for depth estimation and 3D mapping, planning a path, and adjusting flight with a controller. 4 Exp erimen t This section presents the experimental results of the UA V-GPT framework, aiming to v alidate its practicalit y through the op erations of IRA, ESR, and UEC. Meanwhile, we hav e set up exp erimen tal scenarios in b oth simulation and real-world environmen ts. The sim ulation scenario is built using the Airsim sim ulator and the Ue4 engine, while the real-w orld scenario is established with a laptop equipp ed with R TX3080 and a T ello UA V. 4.1 Dataset and Ev aluation Metrics T o ev aluate the p erformance of our framework in complex tasks, we generated v arious requests that com bine simple tasks Figure 6 , encompassing multiple decision types related to UA V control b eha viors in T able 2 . Sp ecifically , our dataset includes different exp erimental backgrounds such as moving forw ard, taking photos, searc hing for ob jects, indo or navigation, outdo or insp ection, etc., as well as sp ecific exp erimen tal tasks based on these backgrounds. Additionally , these tasks are expressed explicitly and implicitly . In explicit expressions, tasks clearly require the UA V to go somewhere and p erform certain tasks, suc h as the task t yp es prop osed in T ask List 1. How ever, implicit expressions do not explicitly state the purp ose but use alternative phrasing to allow the LLM to understand the sp ecific details of the task, for example, “I wan t to eat watermelon, but I don’t know where it is.” The dataset contains a total of 160 task t yp es for calculating C task , with each tw o-dimensional lab el con taining 40 tasks. By comparing the task classification lab els output by the planning end with the lab els of the tasks themselves, IRA can serve as an effective measure of accuracy . Fig. 6 Datasets, the task here are only symbolic; in reality , the task classification is determined by the empirical parameters derived from the exp ert-annotated dataset. F urthermore, each task in the dataset has corresp onding completion criterias, which are used to compare with the output of UA V-GPT’s solutions to determine whether the agen t has successfully 13 addressed the users’ needs. IRA measures the prop ortion of tasks accurately classified by the plan- ning agent within a hypothetical task database, with human classification accuracy set as the 100% b enc hmark. ESR reflects the ratio of successful task executions to total executions under the same task classification and plan inputs. Finally , UEC indirectly indicates p o wer consumption b y measur- ing the flight duration of the UA V during v arious task executions, we compare the UEC levels under the same task to indicate the energy-saving p erformance. Since the flight sp eed of the UA V is fixed in this scenario, we use flight time to estimate the UA V’s energy consumption, enabling more precise data information. 4.2 Prompt Engineering W e briefly present different pre-prompt structures in Figure 7 General Prompt (RP) : In general prompts, the LLM is informed of the role it needs to pla y , the tasks it needs to p erform, and the necessary lo cation information for executing those tasks. How ever, it is not pro vided with any background knowledge related to the tasks or additional information about the curren t scenario. It needs to plan and act based on v ery limited cognition. Con textual Prompt (CP) : Based on general prompts, with contextual prompts, the LLM is pro vided with m ultiple task templates and corresp onding outcomes for learning. Additionally , it is giv en extra information related to the input task, suc h as the coordinates of all ob jects in the scenario and the lo cations of p oten tial obstacles. This information greatly enhances the LLM’s ability to understand the scenario. W e hop e that through this approac h, the LLM can accurately understand the user’s needs and generate solutions. Iterativ e Prompt for Errors (EIP) : Based on the first tw o t yp es of prompts, w e simplify the failed solutions output by the LLM and use them as prompt texts in the pre-prompting stage. This can further refine the system prompts and improv e its p erformance. By standardizing the structure and conten t of the prompts, this metho d can increase the num b er of task templates and outcomes (negativ e templates) in the prompts, allo wing the LLM to anticipate more potential task scenario solutions and ensure more accurate and reliable classification and solution outputs. 14 Fig. 7 Prompts 4.3 Exp erimen tal Results W e used ERNIE-4.0 and GPT-4o as the base models and set the randomness of the mo dels to 0 to eliminate p ossible random resp onses and ensure consistent classification of requests b y the LLM. A t the same time, we used GPT-3 and Llama3 70B as controls to enhance the p ersuasiveness of the exp erimen ts and verify the versatilit y of the framew ork. Firstly , Figure 8 clearly reveals the dif- ferences in task planning and execution strategies b et ween our framework and the traditional HUI framew ork based on LLMs when it comes to executing to ol-assisted tasks. Due to the incorporation of a ROS-based obstacle av oidance to ol, UA V-GPT will plan a smoother route (b) when p erform- ing suc h tasks compared to the traditional framework. Moreo v er, b ecause the dual-end agen t can prev ent LLMs from confusing planning with execution output, which may lead to irrational plan- ning, UA V-GPT will devise a more reasonable and time-saving route when confronted with complex tasks (c). Therefore, we can qualitatively conclude that UA V-GPT exhibits b etter p erformance than the traditional LLM framework when dealing with to ol-assisted and complex tasks. W e repeated the 15 exp erimen ts in Figure 9 using the T ello UA V in real-world scenarios to further v alidate the reliability of our conclusions in Figure 12 . Next, in order to comprehensively ev aluate the performance of UA V-GPT, we will adopt three indicators: IRA, ESR, and UEC, as the testing standards to quan titatively assess the p erformance of our framew ork: Fig. 8 The single-ended planning framework (red b o x) and the UA V-GPT metho d (green b o x) exhibit significant differences in strategies when executing ST and CI tasks. UA V-GPT is more reasonable in execution. Fig. 9 This is the same tasks that we completed in real-world scenarios using the T ello drone 4.3.1 Inten t Recognition Accuracy(IRA) Figure 10 demonstrates the influence of emplo ying v arious types of prefix prompts (EIP) on the success rate of task classification b y UA V-GPT when confronted with tasks of diverse classification 16 lab els. Here, we uniformly use ERNIE-4.0 for testing to av oid the influence of differen t foundational LLMs on the success rate of prefix prompts in classification, as it provides a go od balance b etw een latency and accuracy and is therefore well-suited for our experiments. The findings reveal that EIP ac hieves the highest IRA metric among the three distinct types of prefix prompts. Specifically , in the CI and CT tasks, compared to RP , the success rate of EIP prompts has increased b y an av erage of 25%. When compared to EP , in the CT task, there is a 23% impro vemen t, whic h stems from the incorp oration of environmen tal information and past misclassifications. Consequently , in subsequent exp erimen ts, we will adopt EIP prompts as the type of prefix prompt to minimize the influence of prefix prompts on the exp erimental outcomes. Fig. 10 The p erformance comparison of IRA with different prompt. 4.3.2 T ask Execution Success Rate(ESR) Figure 11 demonstrates the differences in the resp onse of three in teraction frameworks to the ESR metric when faced with tasks of differen t classification labels, based on the premise of EIP . The results sho w that for simple independent tasks and simple tool tasks, there is no significan t difference in success rate b et ween the three metho ds; how ev er, when faced with complex tasks, whether they are complex-simple tasks or complex-indep enden t tasks, the introduction of the dual-end planner signifi- can tly increases the success rate of UA V-GPT compared to the other tw o metho ds (PromptCraft [ 15 ] and CaP [ 40 ], represen ting single-agent LLM SOT A baselines). Additionally , ‘Ours’ employs ERNIE- 4.0 as the base LLM mo del. In the case of CI task classification, the efficiency of UA V-GPT improv es b y 34% compared to the other tw o metho ds, and by 61% in CT tasks. W e attribute this result to the impro vemen t in task classification efficiency and the addition of the planning end. T o further explore the reasons, we conducted additional exp erimen ts on UA V-GPT, integrating the planning end and execution end into a single planner, and then p erforming task classification and execution. The results in Figure 11 confirm our hypothesis: for simple category tasks, there is no significant difference in success rate b et ween single-agent planning and dual-agent planning; how ever, for complex category tasks, the adv antage of dual-agent planning is evident. W e ha ve also designed an exp erimen t to verify whether the inv olvemen t of external to ols can impro ve the success rate of task execution. W e constructed a dataset consisting of 20 independent “simple-to ol” tasks and used UA V-GPT to execute these tasks under tw o different settings: one where the system can normally call external to ols, keeping the task flow consistent with regular op eration; the other where external tool calls are artificially prohibited, meaning that even if the system has identified the task as a “simple-to ol”, it cannot actually schedule the relev ant mo dules for execution. Through this comparative exp erimen t, we aim to further verify the k ey role of external to ols in task execution. The results in Figure 12 clearly sho w that the system’s execution c apabilit y is significantly affected in the absence of external to ol supp ort, thereby highligh ting the imp ortance of to ol inv olv ement for sp ecific task t yp es. 17 Fig. 11 The p erformance comparison of ESR with different HUI frameworks. Fig. 12 The difference between an Execution Agen t using external tools and not using external tools. 4.3.3 UA V Energy Consumption (UEC) Although UA V-GPT demonstrates a high completion rate across tasks with four classification lab els, the efficiency of UA V task execution is also a crucial indicator that cannot b e ov erlo ok ed in practical applications. If a UA V can accomplish a task but takes an excessively long time, it is clearly unrea- sonable. Therefore, we next selected ST and CI category tasks for testing (we pick ed four ST and CI tasks from the first tw o indicator tests that all three interaction frameworks could complete), and let the three frameworks output solutions and execute them resp ectiv ely . Figures 13 show that in the face of simple to ol-based and complex independent category tasks, our method exhibits a significan t adv antage ov er the other tw o pure scripting and single-planning-end metho ds. T o control v ariables and av oid confounding factors, w e still used the EIP prompt—the b est performer under the IRA metric—and fixed ERNIE-4.0 as the LLM base for all tested metho d. The PX4 Autopilot was also consisten tly used as the simulator UA V controller for three tested methods to minimize system-lev el in terference. In real-w orld scenarios, the pow er consumption of a drone before and after fligh t can b e calculated from the battery capacity b efore and after the flight. Ho wev er, there is no physical battery model in the sim ulation environmen t. Therefore, we analyze the “Actuator Outputs (Main)” data in the simulated fligh t control to estimate the p o wer consumption of the drone during task execution. As sho wn in the top picture of Figure 14 , eac h curv e corresponds to the PWM control signal of a main motor. Since the sim ulation do es not mo del the physical loss of the motor, the PWM v alue is considered to represent the motor speed. According to the empirical physical formula, under the condition of constan t air resistance and load, the prop eller p o wer P and the motor sp eed n hav e a cubic relationship: P ∝ n 3 (4) 18 th us, we can obtain the total p o wer: P total ( t ) = 4 X i =1 P i ( t ) (5) finally , we will get the full p o wer during one task: E total ≈ Z P total ( t ) dt (6) T o more intuitiv ely ev aluate the energy consumption of the UA V during task execution, we intro- duced the “Sp eed-to-Po wer Ratio (SPR)” as a visualization indicator on the basis of quantitativ ely calculating the UEC. This indicator is deriv ed from the a verage vector velocity and quantized p o wer consumption within each 2-second in terv al in the task scenario, and can clearly reflect whether the UA V’s pow er consumption is used to increase speed rather than generating additional losses. W e can calculate the SPR through this equation: SPR k = ∥ v k ∥ P t k +2 s t = t k P total ( t ) · δ t (7) Where k represents the k -th 2-second time interv al, v k is the av erage v elo cit y vector within this in terv al, and P total ( t ) is the PWM-estimated p ow er within this p erio d of time. Compared to PromptCraft and CaP , our metho d achiev ed an av erage p erformance adv antage of 70w p er task in ST tasks and 55w p er task in CI tasks. This adv an tage stems from the integration of tool inv o cation and planning ends, enabling Large Language Models LLMs to execute obstacle a voidance tasks in a more direct manner and plan paths from a global p erspective without considering the details of specific execution co des. Fig. 13 The p erformance comparison of UEC with different tasks and HUI frameworks . 19 Fig. 14 In the chart, red areas represent the category of low-efficiency flights (low SPR v alues), and green areas represent the category of high-efficiency flights (high SPR v alues); different color depths are used to reflect the efficiency differences within the same category — the darker the red, the low er the flight efficienc y . T o further analyze the efficiency changes during flight, we simultaneously monitored the effective op erating range of the motor PWM and observed its fluctuation c haracteristics in the time series. In the Figure 14 , the green area represen ts stable PWM output and balanced sp eed, indicating that the aircraft is in a stable propulsion state with high system operating efficiency; the red area shows sharp changes and high-frequency disturbances in the PWM signal, accompanied b y speed jumps, whic h means the system is frequently adjusting the motor to maintain attitude or counter external disturbances, thereby leading to reduced efficiency and increased energy consumption. Based on the SPR v alue calculated from the vector velocity and quantized p o wer consumption, w e visualized the fligh t path in green (high efficiency) and red (low e fficiency), W e designed an interior space with three b edro oms, tw o living ro oms, one kitchen and tw o bathro oms in Figure 15 . Next, we used tw o ST tasks to demonstrate the visualization of flight efficiency represented by SPR v alues, as sho wn in the Figure 16 and Figure 17 : 20 Fig. 15 Simulated indo or environment Fig. 16 Simple to ol task 1 21 Fig. 17 Simple to ol task 2 It is not difficult to find from the quantitativ e and visualized data that our agent system has ac hieved a significan t impro vemen t in the execution efficiency of ST and CI tasks. First, in terms of the planning and execution of CI tasks, the target sequence of high-level task planning is generated b y large mo dels. Since this paper compares a v ariety of differen t large model agen ts, and in order to ensure the fairness of the comparison and in line with the researc h direction of this pap er, we ha ve not carried out sp ecial tuning for a single large mo del in the comparison. It can b e seen from the quantitativ e data that compared with the traditional single-end agent framework, the dual-end agen t structure w e adopted (whic h separates the top-lev el planning and bottom-level execution tasks) sho ws obvious adv antages in the planning of task ob jectives. Sec ond, it can b e clearly seen from the Figure 16 and Figure 17 that in the ST task, compared with the case where only the large mo del at the single agen t end is used for execution, the wa yp oin ts released by our framework during execution ha ve b etter p o wer efficiency . This smoother wa yp oin t release effect is precisely due to the supp ort of the dedicated tools we used. W e also ev aluated the impact of different LLMs on the quality of w ork produced b y this framework T able 3 . The mo dels we selected include ERNIE-4.0, GPT-4o, GPT-3, and Liama3 70B. Using the prompt mo de of EIP , we extracted 10 tasks from eac h of the four task categories in the dataset, totaling 40 tasks for classification testing. T able 3 shows the impact of differen t LLMs mo dels on classification success rates: ERNIE-4.0 and GPT-4o performed the b est, follo wed b y Liama3 70B, and GPT-3 p erformed the worst. T able 3 The impact of different LLMs on the quality of work ERNIE-4.0 GPT-4o Llama3 70B GPT-3 Accuracy rate. 100% 97% 96% 92% 4.4 User Study T o ev aluate the usability and user exp erience of the UA V-GPT framework in real HUI, we designed and conducted a user survey , inviting a total of 60 participants with professional backgrounds in drone or rob ot applications to take part in the exp erimen t. W e firstly aim to ev aluate the user exp erience of the LLM-based UA V-GPT against the traditional ground station (GS) and secondly aim to compare our dual-agent framework with existing single-agent LLM framew orks (CaP and PromptCraft). T o achiev e this, we ev aluated these tw o asp ects separately which allow us to analyze the sp ecific improv emen ts of our dual-agent architecture o ver single-agent metho ds, without the results b eing confounded by the fundamentally different manual control paradigm. W e firstly constructed a user-study dataset by selecting fiv e represen tative tasks from each of the four task classes in Section 3.1 (SI, ST, CI, CT), cov ering scenarios suc h as target insp ection, area patrol and multi-step task planning, resulting in 20 tasks in total. F or each c hosen task, we generated 22 T able 4 Preference betw een UA V-GPT and GS in Group 1 (Q1). CT CI ST SI GS 13 12 9 11 UA V-GPT 11 14 16 14 T able 5 redUser preferences for LLM-based frameworks in Group 2 (counts of choices). Method CT CI ST SI Q3 Q4 Q5 UA V-GPT 24 23 27 21 120 130 140 CaP 16 23 10 16 20 40 30 PromptCraft 7 5 12 16 60 30 30 four execution clips, each sho wing the task completed by: (1) GS, (2) UA V-GPT, (3) CaP and (4) PromptCraft. Participan ts were divided into t wo groups: Group 1 (n=20) was assigned to ev aluate Q1 , where they selected their preferred metho ds b et w een UA V-GPT and GS, Group 2 (n=40) was assigned to ev aluate the three LLM-based framew orks with respect to Q2 : ov erall preference, Q3 : whic h is easier to use, Q4 : which is more adaptable to div erse environmen ts, and Q5 : whic h they w ould prefer in their professional domain. In b oth groups, participants viewed anonymized execution clips for 5 randomly c hosen tasks from user-study dataset. Nest, we analyze the resp onse data of user study . As shown in T able 4 , participan ts in Group 1 preferred the UA V-GPT o ver the GS across three task categories, demonstrating adv antage in o verall user experience. Ho wev er, the system’s reliance on LLM APIs introduces comm unication and inference latencies compared to direct manual control. This dela y is particularly pronounced in complex tasks in volving long action sequences, which partially atten uated the user preference for UA V-GPT in such scenarios. F urthermore, as indicated in T able 5 , our metho d consisten tly outperformed CaP and PromptCraft across all four task categories in terms of Q2 , Q3 , Q4 and Q5 . The adv antage narrow ed slightly only in the low er-difficulty SI scenarios, where the selection proportions among the three metho ds w ere relatively closer. T o statistically v alidate the user preferences observ ed in Q2 (ov erall preference among LLM-based framew orks), w e analyzed the original data from the 40 participan ts in Group 2. Eac h participan t completed 5 trials, yielding 200 total observ ations. F or the analysis, we quantified user preference by computing the selection prop ortion of each metho d (UA V-GPT, CaP , PromptCraft) for every par- ticipan t in Q2 . These prop ortions served as the dep enden t v ariable for a one-wa y rep eated-measures ANO V A with metho d as the within-sub ject factor.The analysis revealed a significant main effect of the framew ork metho d on user preference ( F (2 , 78) = 13 . 12 , p < 0 . 001). Post-hoc pairwise comparisons indicated that UA V-GPT w as c hosen significantly more frequen tly than PromptCraft ( p < 0 . 001) and show ed a go od preference adv an tage o ver CaP ( p ≈ 0 . 05). Additionally , CaP was selected more often than PromptCraft ( p < 0 . 05). These statistical findings corrob orate the descriptiv e distributions sho wn in T able 5 , confirming that participants ov erall prefered UA V-GPT framework ov er other tw o metho ds. T o provide a more in tuitive presentation of the exp erimen tal results, we con verted T able 5 in to a pie chart as shown in Figure 18 . 23 CT CI ST SI Q3 Q4 Q5 Fig. 18 Userstudy , this study clearly demonstrates that our metho d i s more fav ored by users compared to the single- agent approach without external to ols. 5 Discussion and Limitations While UA V-GPT demonstrates significant improv ements in task success rates (45.5% ESR gain) and energy efficiency (62.5w UEC reduction p er task) compared to single-agent framew orks, sev- eral limitations warran t discussion. The parameter tuning pro cess for task complexity co efficien ts ( α , β , γ p , γ d , γ a ) relies on empirical fitting from exp ert-annotated datasets (CLAD and BRM- Data). Although effective in con trolled scenarios Section 4.3 , this approach may require manual recalibration when deploy ed in nov el dynamic en vironments (e.g., agricultural fields with shifting wind patterns), p oten tially increasing deploymen t ov erhead. Regarding tasks adaptability , our frame- w ork achiev es 96% success in obstacle av oidance (Figure 11 ) by integrating real-time replanning to ols like EgoPlanner. How ever, current v alidation is confined to indoor/static settings (Figure 8 ). Extreme outdo or conditions—such as sim ultaneous m ulti-obstacle av oidance or strong electromag- netic interference—remain unaddressed. The dep endency on ROS-based to ols (Section 3.3 ) further limits applicability to resource-constrained edge devices, a critical gap for field applications lik e logistics or en vironmental monitoring. F uture work will prioritize tw o directions: First, implementing federated learning for autonomous parameter tuning to reduce man ual interv ention. Second, extending v alidation to heterogeneous outdo or scenarios with light weigh t to olc hains. 6 Conclusion This pap er prop osed the “UA V-GPT” framework based on the UA V, aiming to achiev e more natural and flexible HUI through LLMs. T raditional UA V interaction designs are limited by predefined task planning, making it difficult to meet users’ p ersonalized needs. UA V-GPT achiev es direct conv ersion and execution of user v oice or text requests by constructing t wo LLM agen ts: a task planning agent and an execution agent. This framew ork leveraged the natural language understanding capabilities of LLMs to accurately classify and plan complex tasks, intelligen tly selecting execution codes or inv oking to ols to tackle them. Exp erimen tal results show that UA V-GPT significantly enhanced the fluency of HUI and the flexibility of task execution. Compared to traditional single-ended LLM planning HUI metho ds, UA V-GPT achiev ed an a verage impro vemen t of 24% in Inten t Recognition Accuracy , 45.5% in T ask Execution Success Rate, and a reduction of 62.5w in energy consumption p er task, effectively meeting users’ personalized demands. 24 References [1] H¨ ohro v´ a, P ., Soviar, J., Srok a, W.: Market analysis of drones for civil use. LOGI–Scientific Journal on T ransp ort and Logistics 14 (1), 55–65 (2023) [2] Flight Standards Departmen t of Civil Aviation Administration of China: Rep ort on Data Statistics of Civil Unmanned Aerial V ehicle Op erators and Cloud Systems (2023) (2023). h ttp://www.caacnews.com.cn/1/1/202404/W020240426699510487229.p df [3] Mirri, S., Prandi, C., Salomoni, P .: Human-drone interaction: state of the art, op en issues and c hallenges. In: Pro ceedings of the ACM SIGCOMM 2019 W orkshop on Mobile AirGround Edge Computing, Systems, Net works, and Applications, pp. 43–48 (2019) [4] Kassab, M.A., Ahmed, M., Maher, A., Zhang, B.: Real-time h uman-ua v interaction: New dataset and t wo nov el gesture-based in teracting systems. IEEE Access 8 , 195030–195045 (2020) [5] Maher, A., Li, C., Hu, H., Zhang, B.: Realtime h uman-uav interaction using deep learning. In: Biometric Recognition: 12th Chinese Conference, CCBR 2017, Shenzhen, China, Octob er 28-29, 2017, Pro ceedings 12, pp. 511–519 (2017). Springer [6] Ra jappa, S., B ¨ ulthoff, H., Stegagno, P .: Design and implementation of a no vel architecture for ph ysical human-ua v interaction. The International Journal of Rob otics Research 36 (5-7), 800– 819 (2017) [7] Haddadin, S., Croft, E.: Ph ysical h uman–rob ot in teraction. Springer handbo ok of rob otics, 1835– 1874 (2016) [8] De Santis, A., Siciliano, B., De Luca, A., Bicc hi, A.: An atlas of ph ysical human–robot in teraction. Mec hanism and Machine Theory 43 (3), 253–270 (2008) [9] Castrillo, V.U., Manco, A., Pascarella, D., Gigante, G.: A review of counter-uas technologies for co operative defensive teams of drones. Drones 6 (3), 65 (2022) [10] Dong, Z., T an, F., Y u, M., Xiong, Y., Li, Z.: A bio-inspired sliding mo de method for autonomous co operative formation con trol of underactuated usvs with ocean en vironment disturbances. Journal of Marine Science and Engineering 12 (9), 1607 (2024) [11] Obrenovic, B., Gu, X., W ang, G., Go dinic, D., Jakhongiro v, I.: Generativ e ai and human–robot in teraction: implications and future agenda for business, so ciet y and ethics. AI & SOCIETY, 1–14 (2024) [12] Apraiz, A., Lasa, G., Mazmela, M.: Ev aluation of user exp erience in h uman–rob ot interaction: a systematic literature review. International Journal of So cial Rob otics 15 (2), 187–210 (2023) [13] T ouvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Go yal, N., Ham bro, E., Azhar, F., et al.: Llama: Op en and efficient foundation language mo dels. arXiv preprin t arXiv:2302.13971 (2023) [14] Singh, I., Blukis, V., Mousavian, A., Go yal, A., Xu, D., T rembla y , J., F ox, D., Thomason, J., Garg, A.: Progprompt: program generation for situated robot task planning using large language mo dels. Autonomous Rob ots 47 (8), 999–1012 (2023) [15] V emprala, S.H., Bonatti, R., Buc ker, A., Kapo or, A.: Chatgpt for robotics: Design principles and mo del abilities. IEEE Access (2024) [16] Liu, J., Chai, C., Li, H., Gao, Y., Zh u, X.: Llm-informed drone visual inspection of infrastructure (2024) 25 [17] Sun, G., Xie, W., Niyato, D., Du, H., Kang, J., W u, J., Sun, S., Zhang, P .: Generativ e ai for adv anced uav netw orking. arXiv preprin t arXiv:2404.10556 (2024) [18] Kim, Y.S., Lee, J., Lee, S., Kim, M.: A force reflected exoskeleton-t yp e masterarm for human- rob ot interaction. IEEE T ransactions on Systems, Man, and Cyb ernetics-P art A: Systems and Humans 35 (2), 198–212 (2005) [19] Scheggi, S., Chinello, F., Prattic hizzo, D.: Vibrotactile haptic feedback for human-robot interac- tion in leader-follow er tasks. In: Pro ceedings of the 5th In ternational Conference on PErv asive T echnologies Related to Assistive Environmen ts, pp. 1–4 (2012) [20] Peppoloni, L., Brizzi, F., Avizzano, C.A., Ruffaldi, E.: Immersive ros-integrated framework for rob ot teleop eration. In: 2015 IEEE Symp osium on 3D User In terfaces (3DUI), pp. 177–178 (2015). IEEE [21] Tsetserukou, D., T adakuma, R., Ka jimoto, H., Kaw ak ami, N., T achi, S.: T o wards safe human- rob ot interaction: Joint imp edance control of a new teleop erated rob ot arm. In: RO-MAN 2007- The 16th IEEE International Symp osium on Rob ot and Human Interactiv e Communication, pp. 860–865 (2007). IEEE [22] Su, Y.-P ., Chen, X.-Q., Zhou, T., Pretty , C., Chase, G.: Mixed-reality-enhanced h uman–rob ot in teraction with an imitation-based mapping approach for intuitiv e teleop eration of a robotic arm-hand system. Applied Sciences 12 (9), 4740 (2022) [23] F ani, S., Ciotti, S., Catalano, M.G., Grioli, G., T ognetti, A., V alenza, G., Ajoudani, A., Bianchi, M.: Simplifying telerobotics: W earability and teleimp edance improv es human-robot in teractions in teleop eration. IEEE Rob otics & Automation Magazine 25 (1), 77–88 (2018) [24] Lahmeri, M.-A., Ghanem, W.R., Bonfert, C., Schober, R.: Robust tra jectory and resource opti- mization for comm unication-assisted uav sar sensing. IEEE Op en Journal of the Communications So ciet y 5 , 3212–3228 (2024) [25] Arnold, K.P .: The ua v ground control station: T yp es, comp onen ts, safet y , redundancy , and future applications. In ternational Journal of Unmanned Systems Engineering. 4 (1), 37 (2016) [26] C ¸ inta ¸ s, E., ¨ Ozy er, B., S ¸ im¸ sek, E.: Vision-based mo ving ua v trac king b y another uav on lo w-cost hardw are and a new ground control station. IEEE Access 8 , 194601–194611 (2020) [27] Xiong, T., Liu, F., Liu, H., Ge, J., Li, H., Ding, K., Li, Q.: Multi-drone optimal mission assignmen t and 3d path planning for disaster rescue. Drones 7 (6), 394 (2023) [28] W ang, R., Y ang, Z., Zhao, Z., T ong, X., Hong, Z., Qian, K.: Llm-based rob ot task planning with exceptional handling for general purp ose service rob ots. arXiv preprint arXiv:2405.15646 (2024) [29] Mai, J., Chen, J., Qian, G., Elhoseiny , M., Ghanem, B., et al.: Llm as a rob otic brain: Unifying ego cen tric memory and con trol (2023) [30] Mow er, C.E., W an, Y., Y u, H., Grosnit, A., Gonzalez-Billandon, J., Zimmer, M., W ang, J., Zhang, X., Zhao, Y., Zhai, A., et al.: Ros-llm: A ros framework for em b o died ai with task feedbac k and structured reasoning. arXiv preprin t arXiv:2406.19741 (2024) [31] Koubaa, A.: Rosgpt: next-generation human-robot in teraction with chatgpt and ros. Preprints (2023) [32] Jav aid, S., F ahim, H., He, B., Saeed, N.: Large language mo dels for ua vs: Current state and path wa ys to the future. IEEE Open Journal of V ehicular T echnology (2024) [33] Cui, J., Liu, G., W ang, H., Y u, Y., Y ang, J.: Tpml: T ask planning for multi-ua v system with 26 large language mo dels. In: 2024 IEEE 18th International Conference on Control & Automation (ICCA), pp. 886–891 (2024). IEEE [34] Lyko v, A., Karaf, S., Mart ynov, M., Serpiv a, V., F edoseev, A., Konenk ov, M., Tsetseruk ou, D.: Flo c kgpt: Guiding ua v flo c king with linguistic orchestration. arXiv preprint (2024) [35] T AZIR, M.L., MANCAS, M., DUTOIT, T.: F rom words to fligh t: Integrating openai c hatgpt with p x4/gazeb o for natural language-based drone control. In: International W orkshop on Computer Science and Engineering (2023) [36] Zhu, Y., W ang, T., W ang, C., Quan, W., T ang, M.: Complexity-driv en trust dynamics in h uman– rob ot interactions: insights from ai-enhanced collaborative engagements. Applied Sciences 13 (24), 12989 (2023) [37] Sweller, J.: Cognitiv e load theory . In: Psychology of Learning and Motiv ation vol. 55, pp. 37–76. Elsevier, ??? (2011) [38] V erwimp, E., Y ang, K., Parisot, S., Hong, L., McDonagh, S., P´ erez-Pellitero, E., De Lange, M., T uytelaars, T.: Clad: A realistic contin ual learning b enc hmark for autonomous driving. Neural Net works 161 , 659–669 (2023) [39] Zhang, T., Li, D., Li, Y., Zeng, Z., Zhao, L., Sun, L., Chen, Y., W ei, X., Zhan, Y., Li, L., et al.: Emp o wering embo died manipulation: A bimanual-mobile rob ot manipulation dataset for household tasks. arXiv preprin t arXiv:2405.18860 (2024) [40] Liang, J., Huang, W., Xia, F., Xu, P ., Hausman, K., Ich ter, B., Florence, P ., Zeng, A.: Co de as p olicies: Language mo del programs for em b odied con trol. In: 2023 IEEE International Conference on Rob otics and Automation (ICRA), pp. 9493–9500 (2023). IEEE 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment