I am trying to use crossformer to follow a sequence of images (Topomap). Since the paper mentions that it is similar to ViNT and NoMaD papers. I expected the crossformer network to predict distances to the goal/subgoal images. However, I don't see any distance prediction network in the crossformer architecture (perhaps I am missing something here). So is the assumption that a robot can only imitate from the starting point to any end point on the trajectory without localizing itself in the topomap by only blind imitation?