An important challenge for any cognitive visual system is learning to map pixel-wise input from the eyes or camera(s) onto compositional internal representations that can drive decisions, actions, and memory construction. Deep learning approaches provide models of visual learning but they rely on increasingly massive data sets. In contrast, human children learn to perceive with more constrained sets of data, in many cases residing largely within a few rooms of one house for the first year of life. The learning process in children is not completely understood but it likely capitalizes on interaction with the world rather than a sequence of randomly ordered images, unlike many machine vision approaches. Our research attempts to develop new algorithms that accomplish the first step of such interactive learning. While moving through an environment, an agent senses the passage of time and spatial position, which provide metrics of similarity that can be used as a self-supervisory training signal. Such a learning mechanism, could be available to a child prior to their ability to understand verbal or social cues and would also be effective for artificial agents attempting to learn the visual statistics of a new environment.