Spectacles - EyeConnect

Introduction

Shared Augmented Reality experiences are part of Spectacles’ mission to connect humans in the real world. In a shared experience, each user can see and interact with the digital content from their own perspective as if it were real.

Two users joining a shared session with EyeConnect.

Traditional methods require all users to scan the environment or fiducials placed in the environment; this can be cumbersome in certain situations. EyeConnect on the other hand allows users to seamlessly join shared experiences: In order to join a shared session with EyeConnect, two users in the same physical space just need to look at each other. The algorithm operates by tracking nearby Spectacles and then comparing the motion data other devices sense against its own tracked movement patterns. A robust optimizer finds the 6DoF pose (position and orientation) that aligns both virtual worlds so that shared virtual objects appear accurately in the same physical location for all connected users.

As an example, the figure below illustrates two users, Anna and Joe, joining a session. Anna hosts the experience, which is advertised to all Spectacles nearby via Bluetooth.

Once Joe joins, both users are prompted to look at each other. Simultaneously, EyeConnect initiates a Spectacles Tracker on each device that detects Spectacles glasses in the camera feed. Anna’s device transmits her 3D head poses and 2D Spectacles detections of Joe to Joe’s device. Joe's device similarly tracks Anna's Spectacles and his own 3D head poses.

Our Egomotion Alignment algorithm then processes all this information to determine the optimal alignment between the two users. To minimize waiting time, a preliminary, lower-quality alignment is established after a few seconds only and is continuously refined thereafter.

EyeConnect system overview.

EyeConnect is a feature designed with privacy in mind. It only shares tracked device positions and 2D points between devices, but no visual information (no images). All data used for EyeConnect is immediately deleted after the content is aligned. In contrast to systems based on Simultaneous Localization and Mapping (SLAM), EyeConnect doesn't require users to look at the same object or scene; instead, they must look at each other. In typical conditions, users can join a shared session in less than 5 seconds.

Traditional Approaches

State-of-the-art approaches leverage a mix of map-based relocalization, visual fiducials, shared visual content, and carefully designed user interface to deliver robust, low-latency and user friendly experiences in collaborative environments.

Map-Based Relocalization: Devices use pre-built spatial maps that allow new users or devices to visually re-localize to a common physical coordinate frame. The device's live sensor data is matched against this existing map to determine its exact pose. This is a foundational, continuous process often combining sensor fusion (visual-inertial odometry, visual-SLAM) for high accuracy (even in GNSS-denied or resource-constrained environments), low-drift pose estimation, and it relies on having the detailed map beforehand.

Visual Markers: remain popular and practical for fast alignment. Solutions like ARToolKit, AprilTag, ArUco or similar optical markers offer a robust and efficient method for establishing a visual fiducial within a scene. These markers are uniquely designed patterns, often square and featuring a high-contrast binary design, that are readily recognized and accurately localized by device cameras and computer vision algorithms. Their primary function is to provide precise, real-world coordinate systems relative to the camera, making them invaluable tools for various applications, including AR.

Practical Considerations and Challenges

Marker-based approaches like AprilTag are simple, cost-effective and require minimal infrastructure, but lack widespread adoption in the industry as users need to manufacture a tag first.

Ensuring persistent, drift-free co-location among multiple users remains challenging, especially in environments with little texture, high dynamics or occlusion.

Map-based and visual-content strategies provide persistent, high-accuracy alignment at scale, but require prior mapping or robust visual feature matching and texture rich static scenes, demanding more computation and coordination.

Sharing virtual representations of physical environments may raise privacy concerns especially if the collected sensor data persists beyond an ad-hoc shared session.

Spectacles Tracker

EyeConnect is based on a dynamic-object anchoring approach using the only object that is always present: Spectacles. Being able to accurately, robustly and efficiently detect Spectacles glasses in the camera feed is therefore of utmost importance.

Our custom-designed network is trained to detect keypoints on Spectacles for facilitating pose estimation. While a single, unique 2D keypoint is sufficient for Egomotion Alignment, we introduced five distinct, well-defined keypoints to enable tracking and enhance robustness. Using five key points also makes the algorithm more resilient to angular changes and occlusion.

To ensure optimal utilization of our Digital Signal Processor (DSP), we employ a low-parameter convolutional neural network whose encoder is quantized. This configuration achieves a rapid runtime of just 3 ms on current Spectacles. For training, we gathered and labeled real-world sequences of people interacting with each other.

EyeConnect minimizes power consumption by employing both a detector and a tracker. Initially, Spectacles are detected by running inference on sliding windows at various scales, downscaled to 128x128 pixels. Following detection, the same network is used to track the Spectacles.

To accommodate simultaneous multi-user scenarios, the system continues to detect new Spectacles until it is actively tracking a maximum of three Spectacles. This limit is set to ensure sufficient computational resources remain available for other concurrent Computer Vision algorithms.

Spectacles Tracker: The pink bounding box marks tracked Spectacles.

Egomotion Alignment

Egomotion Alignment is the mathematical foundation of EyeConnect. To determine the relative pose (rotation R and translation t) between the coordinate systems of two Spectacles, several inputs are required:

3D Head Motion Poses: Must be sampled at a high frequency (ideally >30Hz).
2D Point Observations: Observations of the other Spectacles (e.g. features on the device) in the camera images.
Camera Parameters: Intrinsic and extrinsic parameters for all cameras involved.

The goal is to estimate the unknown rotation (R) and translation (t) between the two coordinate systems.

Two people joining a shared session. Without alignment, AR content would be rendered in a different physical location for each user (left). Having aligned both coordinate origins (O^A and O^J) using EyeConnect (right) both users see virtual objects in the same place.

Estimating these unknowns is inherently a highly non-linear and noise-sensitive problem. However, the problem can be simplified because both coordinate systems are gravity-aligned, which fixes the pitch and roll angles. Consequently, we only need to estimate the yaw angle and the translation. This simplification allows us to formulate the problem as a Quadratic Eigenvalue Problem (QEP). The QEP can be solved - provided we have a minimum of four independent observations.

Solving the QEP using observations from only one set of Spectacles against the head poses of the other typically yields two reasonable real solutions. To enhance robustness, we introduce a bi-directional constraint.

This is achieved by sending head poses and observations to the "joining" device. The joining device then only establishes the shared session if it finds a solution that successfully utilizes observations from both devices. This bi-directional constraint significantly improves robustness, enabling a reliable solution even with typical noise levels.

When there are more than two Spectacles users we need to associate all 2D Point observations with the corresponding 3D head poses. We have found that using RANdom SAmple Consensus (RANSAC) effectively solves this association problem, even in environments with many bystanders.

Since Egomotion Alignment combines motion data from multiple devices, we need to also ensure an accurate common clock for time stamps. While clock domains between Spectacles can initially be synced using protocols like Network Time Protocol (NTP), we found NTP alone to be insufficiently accurate. Therefore, NTP is only used to provide an initial guess for aligning the devices' steady clocks. Following this initial alignment and given some initial observations, a grid search is performed to establish a more accurate clock synchronization.

Results

We tested EyeConnect on many sequences that were recorded with Spectacles. Recordings contain two camera streams per device at 30Hz. The recordings focus on scenes that are challenging for EyeConnect such as crowded scenes, little movement, and late/no encounter. For ground truth pose data we added AprilTags into the scenes when recording test sequences.

Data recordings with AprilTags for groundtruthing. Tracked Spectacles are indicated by the pink rectangles. Users are connected to a shared experience once the green check mark appears.

EyeConnect uses the number of observations and number of inliers to estimate the quality of a found solution. Once the quality gate hits at least “Low” we propagate the shared coordinate frame to the system which then enters a shared experience. In these challenging test scenarios this first solution had a median error of 15cm. The algorithm keeps tracking and therefore finding better solutions. Tracking is stopped once “High” quality is reached. At high quality, the median error within a 5m range was 2.2cm. The time to first fix for 90% of all sequences was at 2.6 seconds.

The median spatial error within a 5m distance is below 15cm when the shared experience starts. EyeConnect stops when the error is reduced to 2cm.

Conclusion

EyeConnect allows users to join shared experiences with hardly any friction.

Apart from typical CV challenges such as low-light scenarios, we identified two scenarios that are specifically challenging for EyeConnect: crowded environments and lack of motion.

For crowded environments the algorithm has an association problem as it needs to associate 2D observations with 3D device trajectories. For the scenario, where users don’t move, EyeConnect struggles to reduce the localization error as it can only make use of the rather short baseline of its stereo camera pair - we found this to rarely be a problem in practice though.

Despite these challenges, EyeConnect excels at creating seamless and immediate shared experiences, leveraging its technology to quickly connect users in a shared virtual space with minimal setup or delay.