From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The recent advancements introduced by Large Language Models (LLMs) have transformed how Artificial Intelligence (AI) can support complex, real world tasks, pushing research outside the text boundaries towards multi modal contexts and leading to Multimodal Large Language Models (MLMs). Given the current adoption of LLM based assistants in solving technical or domain specific problems, the natural continuation of this trend is to extend the input domains of these assistants exploiting MLMs. Ideally, these MLMs should be used as real time assistants in procedural tasks, hopefully integrating a view of the environment where the user being assisted is, or even better sharing the same point of view via Virtual Reality (VR) or Augmented Reality (AR) supports, to reason over the same scenario the user is experiencing. With this work, we aim at evaluating the quality of currently openly available MLMs to provide this kind of assistance on technical tasks. To this end, we annotated a data set of furniture assembly with step by step labels and manual references: the Manual to Action Dataset (M2AD). We used this dataset to assess (1) to which extent the reasoning abilities of MLMs can be used to reduce the need for detailed labelling, allowing for more efficient, cost effective annotation practices, (2) whether MLMs are able to track the progression of assembly steps (3) and whether MLMs can refer correctly to the instruction manual pages. Our results showed that while some models understand procedural sequences, their performance is limited by architectural and hardware constraints, highlighting the need for multi image and interleaved text image reasoning.


💡 Research Summary

The paper “From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs” introduces the Manual‑to‑Action Dataset (M2AD), a new benchmark designed to test the capabilities of multimodal large language models (MLMs) in realistic procedural assistance scenarios. The authors argue that while large language models (LLMs) have achieved impressive text‑only performance, the next frontier is multimodal assistance—especially in technical domains where a user may benefit from a model that can see the same visual scene (via a camera, AR headset, or VR environment) and consult an instruction manual in real time. Existing benchmarks (e.g., IKEA Manuals at Work, ENIGMA‑51, HoloAssist) focus on isolated skills such as action recognition, error detection, or simple video‑text alignment, and they lack the depth needed to evaluate end‑to‑end procedural reasoning, step tracking, and manual referencing.

Dataset Construction
M2AD consists of 53 distinct IKEA furniture items, each represented by a publicly available YouTube assembly video and the corresponding PDF manual scraped from IKEA’s website. The authors manually aligned each video segment with a specific step in the manual, producing 1,228 step‑level annotations. Each annotation records: (i) start and end frame numbers, (ii) the step number as listed in the manual, and (iii) the page number of the manual where the step is described. Statistics show an average of 23.2 steps per video, with step durations averaging 21.1 seconds (σ = 20.5 s) and total assembly times averaging 652.5 seconds. The dataset captures realistic variability: some items have as few as 2 steps, others up to 71; inter‑step gaps average 7.3 seconds. Importantly, the authors also analyze non‑consecutive step transitions (skipping forward or revisiting previous steps), visualizing forward jumps in warm colors and backward regressions in cool colors. This analysis reveals that early assembly phases are more flexible, reflecting user expertise and confidence.

Evaluation Tasks
To assess open‑source MLMs that can run on consumer‑grade hardware, the authors define three tasks:

  1. Progress Tracking – predict how many steps have been completed up to a given video timestamp.
  2. Video‑Manual Mapping – identify the exact manual page that corresponds to a given video clip.
  3. Current Step Identification – output the step number being performed in real time.

Four publicly available models were evaluated (LLaVA‑3.2 Vision, Fuse, LLaVA, MolMo, Ovis), each adapted to accept either a single frame or a short image sequence due to GPU memory constraints. The best model achieved roughly 62 % top‑1 accuracy on the mapping task, 58 % on progress tracking, and a 35 % error rate on current‑step identification. Performance degraded sharply when videos involved non‑linear step transitions, indicating that current architectures struggle with long‑range temporal dependencies and with handling multiple visual inputs simultaneously.

Analysis of Limitations
The authors identify three primary bottlenecks:

  • Multi‑image handling: Most MLMs process only one frame or a few sampled frames, which prevents them from building a coherent representation of the ongoing assembly.
  • Shallow cross‑modal attention: Token‑level or feature‑level fusion layers are often shallow, limiting the model’s ability to relate visual cues across time to textual instructions.
  • Hardware constraints: Consumer GPUs cannot hold high‑resolution video streams, forcing aggressive down‑sampling and frame selection that discards crucial context.

These constraints explain why models can recognize isolated actions but fail to maintain a coherent procedural narrative, especially when users skip steps or backtrack.

Future Directions
The paper proposes several research avenues:

  • Develop video‑transformer encoders that can ingest longer frame sequences and feed richer visual embeddings into the language model.
  • Employ parameter‑efficient fine‑tuning techniques such as LoRA to adapt large vision‑language models without exceeding memory budgets.
  • Design streaming architectures capable of processing frames in real time while dynamically querying the manual (e.g., “show me page 5 now”).
  • Enrich M2AD with finer‑grained action labels (e.g., “tighten screw #3”) to enable hierarchical reasoning from low‑level motor actions up to high‑level steps.

Conclusion
M2AD fills a critical gap in multimodal evaluation by providing a realistic, step‑aligned video‑manual dataset that stresses both visual understanding and procedural reasoning. The benchmark demonstrates that current open‑source MLMs, while promising, are limited by their handling of multi‑image inputs, shallow cross‑modal attention, and hardware constraints. Advancing beyond these limits will be essential for building truly helpful AR/VR assistants that can guide users through complex technical tasks in real time.


Comments & Academic Discussion

Loading comments...

Leave a Comment