
Sign up to save your podcasts
Or
Dual-System Architecture: "System 1" and "System 2"
System 2 in the Helix Vision-Language-Action model is a high-level, internet-pretrained Vision-Language Model (VLM) designed to process complex sensory and linguistic data. It operates at a moderate frequency, typically around 7–9 Hz, which allows it to integrate a broad context for each decision it makes. This system receives image data from the robot’s multiple cameras, as well as proprioceptive data from its internal sensors, and fuses this with natural language commands from human operators. The output is a single, continuous latent vector that encapsulates the semantic intent of the task at hand, providing a goal-oriented representation for downstream processing. System 2 is built with a transformer-based architecture, leveraging billions of parameters to ensure nuanced understanding of both visual and linguistic input. It is trained on vast datasets, including internet-scale image-text pairs and teleoperation demonstrations, enabling it to generalize across a wide variety of tasks and environments. This high-level reasoning system is responsible for interpreting ambiguous or open-ended instructions, such as “set the table for dinner” or “sort these packages by destination.” It breaks down these commands into actionable sub-goals, which are then passed to System 1 for execution. System 2 can also maintain short-term and long-term memory, allowing the robot to track ongoing tasks and adapt its behavior based on changing circumstances. This is particularly useful in dynamic environments like warehouses or homes, where priorities and obstacles can shift rapidly. System 2’s latent vector is designed to be robust to noise and partial observations, ensuring that the robot can continue to function even if some sensors are temporarily occluded or malfunctioning. The architecture supports multi-modal fusion, so that the robot can reason about objects, people, and spaces in a unified way. System 2 also enables collaborative behaviors, allowing multiple robots to coordinate actions by sharing goal representations. Its reasoning capabilities extend to safety, as it can detect hazardous situations and modify plans to avoid risk. The system is optimized for energy efficiency, balancing computational demands with the robot’s battery constraints. Regular cloud-based updates ensure that System 2’s knowledge and skills remain current, incorporating the latest advances in AI research. The modular design allows for future expansion, so new sensory modalities or reasoning capabilities can be added as needed. System 2’s outputs are interpretable, making it possible for human supervisors to audit decisions and ensure compliance with ethical guidelines. In summary, System 2 provides the high-level intelligence that allows Figure 03 to function autonomously in complex, real-world settings, bridging the gap between human intent and robotic action.
Dual-System Architecture: "System 1" and "System 2"
System 2 in the Helix Vision-Language-Action model is a high-level, internet-pretrained Vision-Language Model (VLM) designed to process complex sensory and linguistic data. It operates at a moderate frequency, typically around 7–9 Hz, which allows it to integrate a broad context for each decision it makes. This system receives image data from the robot’s multiple cameras, as well as proprioceptive data from its internal sensors, and fuses this with natural language commands from human operators. The output is a single, continuous latent vector that encapsulates the semantic intent of the task at hand, providing a goal-oriented representation for downstream processing. System 2 is built with a transformer-based architecture, leveraging billions of parameters to ensure nuanced understanding of both visual and linguistic input. It is trained on vast datasets, including internet-scale image-text pairs and teleoperation demonstrations, enabling it to generalize across a wide variety of tasks and environments. This high-level reasoning system is responsible for interpreting ambiguous or open-ended instructions, such as “set the table for dinner” or “sort these packages by destination.” It breaks down these commands into actionable sub-goals, which are then passed to System 1 for execution. System 2 can also maintain short-term and long-term memory, allowing the robot to track ongoing tasks and adapt its behavior based on changing circumstances. This is particularly useful in dynamic environments like warehouses or homes, where priorities and obstacles can shift rapidly. System 2’s latent vector is designed to be robust to noise and partial observations, ensuring that the robot can continue to function even if some sensors are temporarily occluded or malfunctioning. The architecture supports multi-modal fusion, so that the robot can reason about objects, people, and spaces in a unified way. System 2 also enables collaborative behaviors, allowing multiple robots to coordinate actions by sharing goal representations. Its reasoning capabilities extend to safety, as it can detect hazardous situations and modify plans to avoid risk. The system is optimized for energy efficiency, balancing computational demands with the robot’s battery constraints. Regular cloud-based updates ensure that System 2’s knowledge and skills remain current, incorporating the latest advances in AI research. The modular design allows for future expansion, so new sensory modalities or reasoning capabilities can be added as needed. System 2’s outputs are interpretable, making it possible for human supervisors to audit decisions and ensure compliance with ethical guidelines. In summary, System 2 provides the high-level intelligence that allows Figure 03 to function autonomously in complex, real-world settings, bridging the gap between human intent and robotic action.