I suspect you will find that although you cannot perceive absolute spatial relationships, e.g. exact distances, directions, slopes, degrees of curvature of surfaces, trajectories of motion, you can nevertheless with high certainty perceive many partial orderings and changes in partial orderings.
For example, for many pairs of visible surface features of different objects visible in the scene, you can tell which of two visible surface features is closer to the viewer, and whether the perceived 2D (projected) distance between two visible features is increasing or decreasing as the video progresses, e.g. the perceived size of gap between two bars, in the back of the chair, or the perceived projected distance between part of the top edge of the chair and an edge of a floor tile, or the bottom edge of one of the cupboards.
I suggest that both the richness of human visual experience and the enormously skillful use of visual information by fast moving animals in complex environments, e,g. squirrels moving through tree-branches to get to food or to get to another tree, and birds moving rapidly towards or away from a nest, during nest construction, or when feeding fledglings, suggest that a great deal of spatial information about structures and changes in the visual field and their static and changing relationships is used (with the help of multiple constraint-propagation processes) to assemble a mass of detailed information required for a variety of control decisions on different time scales.
What are the likely causes of the changes in sensory/sensed information, and insofar as there are specific changes e.g. in relative size, relative distance, relative orientation, of objects or object parts, consider whether those visible changes provide information about what exists or is happening in the "world"?
What difference would it make if, in addition to the visual sensory information available in the videos you also had information (proprioceptive information and efferent feedback) about motor signals a brain might generate to limbs, fingers, etc. to produce the changes in viewpoint? What might you be able to perceive (infer, learn) from the visual information that you cannot without that? Compare Berthoz (2000).
E.g. are there changes in relationships between objects in 3D space (e.g. contact, direction, distance, relative orientation, obstructing line of sight) that help to provide useful information about what objects there are, what parts they are composed of, and what the spatial relationships between various parts and various objects are?
What sorts of AI reasoning mechanisms could make use of all those sources of sensory information, possibly combined with records of motor signals ("afferent copies" is the misleading jargon) in order to acquire information that would be relevant to control decisions related to possible actions on the perceived objects, e.g. biting, grasping, pushing, moving, lifting, stacking, avoiding, etc.?
Readers are invited to look for examples of these points in the videos, and to consider what difference it would make if in addition to the visual input signals you also had information about motor output signals.
Seeing a single chair, plus background, from a moving viewpoint (MP4) (1 min 43 secs)Some questions about the videos:
Think about how what you see in the background changes as the viewpoint changes, and how visual relationships between parts of the chair and parts of the background change. Many details change in coordinated ways that give strong clues as to the 3D spatial layout of the perceived scene, as well as providing information about affordances as suggested by Gibson. A simple example noted by Gibson is that when moving towards a surface, the centre of expansion of visible texture in the visual image and the rate of expansion of texture give information about the likely point of contact with the surface, and the time to contact. We can generalise this point to include different directions, speeds, and centres of optical flow happening simultaneously when viewing a scene with multiple surfaces. Moreover, relationships between patterns of optical flow can give evidence as to whether two visible surface patches are parts of the same flat surface (e.g. floor, wall, table-top, cupboard door, etc.) If we combine Gibson's ideas with those of Trehub(1991), according to which the primary visual cortex is mainly an optical feature capture device, whose results are immediately copied to other brain centres for processing according to different needs. (This would explain why the blind spot is not perceived: there is nothing there to be copied to other cognitive sub-systems.)Seeing two chairs, plus background, from a moving viewpoint (MP4) (1min 17 secs)
Think about how the presence of a second chair affects the patterns of change in visual input related to motion of the viewer. What additional problems, and additional opportunities, does the added 3D complexity bring to the visual task.Seeing a chair and a pot plant, plus background, from a moving viewpoint (MP4) (47 secs)
The added complexity produced by the second chair and pot plant differ in both kind, numerosity (of changing features) and variety of changes of visibility of surfaces, edges, textures, etc. Is it possible that the extra spatial complexity in the structures and processes enriches what an intelligent perceiver can perceive without having to be trained on different objects, configurations of objects and processes.
Some questions relevant to specifying the requirements to be met:
Which surfaces and parts of surfaces are visible at any time in these videos?
How are they seen? As collections of 3D or 2D points? As planar surface fragments stitched together? As curved 3D surfaces? As moving surfaces? Moving in which directions relative to what? As processes in which there are surface-like features?
Which answers should be regarded as parts of specifications to be met by designs for intelligent robots with human-like perception and action capabilities?
How are the contents of the perception processes implemented?
(a) spatially,
e.g. absolute positions, orientations, etc. or merely
partial orderings of distance, size, curvature, etc.?
(b) in terms of occlusion or partial occlusion,
(c) in terms of function (e.g. supporting, leaning on, constraining, ...)?
What movements are made by the camera (eyes) during the video? How would the eye
movements be represented in the control subsystems?
Changes of location, changes of orientation, changes of "fixation centre"?
Which of the changes should a robot visual system be able to detect and
represent?
How do visibility changes depend on relative locations and relative motions of objects seen and the camera?
At any stage, select a portion of a visible surface and consider whether and how its visibility would be changed by various changes of viewpoint (camera translation -- forwards, backwards, sideways, at some other angle relative to the line of sight, ...?).
How should an intelligent perceiver control viewing direction (rotation of
camera)? When watching videos like these you cannot retrospectively change the
viewing direction of the camera, though you can
-- change the viewing direction of your eyes, fixating different image fragments,
-- change portions and aspects of the image and the scene attended to
(how would change of attention be implemented in a robot?)
-- notice and attend to changes of the camera's viewing direction (camera rotation)
Does an intelligent machine, or animal (or normal human) need to be able to estimate absolute distances, sizes, speeds, orientations, rates of rotation?
What alternatives to absolute metrics would be useful for an intelligent agent, and how?
What sorts of internal data-structure could enable a robot (or a brain) to
represent
-- what is visible at any time (structures and processes)?
-- the changes that occur?
-- how various movements alter what it is possible, or impossible, to see?
Try the following experiments at various stages in the videos:
Stop the video and select two locations in the scene where a surface is partly
visible and partly occluded (because an occluding edge is present). Then
consider
--
which motions of the chair, or of the camera/eye location, if any, will
simultaneously make MORE of
both surface fragments visible
--
which motions will
simultaneously make LESS of both surface fragments visible.
What can you conclude about differences in spatial consciousness between a
normally sighted person and a blind person?
(Note: humans born blind from birth may have access to brain mechanisms that
could only have evolved in ancestors with visual capabilities.)
Are there important questions relating to the nature and functions of spatial perception that I've left out?
I suspect answering questions like that would be more useful for an intelligent machine than computing actual locations in a coordinate space. One reason is that qualitative changes in those features can be useful for controlling movements, e.g. steering towards a doorway, without having to compute 3D locations, directions, or distances. (Compare G
Conjecture
Perceptual mechanisms of humans and other intelligent animals (including many
mammals and birds) include abilities to reason about how physical changes will
affect availability of information.
Such abilities could be labelled abilities to detect (and use) cognitive affordances.
In reflex responses to triggering stimuli the processing happens without any consideration of alternative options. In more intelligent responses, animals, or robots may include varying levels of sophistication and self-awareness in choosing how to process the new information.
Origins of ancient mathematics
The ability to perform a variety of processes of reasoning about possible and
impossible changes in visibility (epistemic affordances), including reasoning
about possibilities and impossibilities in novel configurations of objects, was
an important evolutionary and developmental precursor to abilities to make
ancient mathematical discoveries in geometry and topology.
I suspect there is no mechanism known to neuroscientists that explains such abilities.
The polyflap domain was proposed in 2005 as a domain in which robotic and
psychological experiments could be performed to investigate these mechanisms.
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/polyflaps
Humans do not always have to be trained on particular configurations of
information in order to reason about them. Presumably some of the mechanisms are
specified in the genome, though they may not all be expressed at birth.
(For further discussion of mechanisms required in the genome in order to support
human-like mathematical capabilities,
see The Meta-Configured Genome:
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/meta-configured-genome.html
Also pdf.)
There are many unanswered questions:
-- How do brains do such reasoning?
-- What brain mechanisms make that possible?
-- How are these abilities related to abilities to make discoveries in topology
and geometry?
-- Is it possible to implement mechanisms with those powers using digital computing machinery?
-- What alternatives are there?
See this brief discussion of Turing's distinction between the roles of
intuition and ingenuity in mathematical cognition:
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/turing-quotes.html
What kinds of geometric/topological insights are used in perceiving and
understanding the above videos?
Is it possible that replicating those abilities
in robots will require use of new
kinds of computing machinery, combining discrete and continuous changes? For an
incomplete discussion see:
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/super-turing-geom.html
How would use of Kinect alter a machine's ability to answer the questions above?
Is it possible that biological evolution made far more use of structures and processes in 2D projections of 3D scenes than 3D information available using stereo mechanisms, requiring visual fields to overlap?
What are the relative advantages of both for a mobile viewer perceiving and acting in a 3D environment?
James J. Gibson, 1979 The Ecological Approach to Visual Perception, Houghton Mifflin, Boston, MA,
Aaron Sloman (2007-2018),
Predicting Affordance Changes (Alternative ways to deal with uncertainty),
(Unpublished technical report),
School of Computer Science, University of Birmingham,
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/changing-affordances.html
Aaron Sloman
The Meta-Morphogenesis project (2012, ff)
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/meta-morphogenesis.html
Arnold Trehub, 1991,
The Cognitive Brain,
MIT Press,
Cambridge, MA,
http://people.umass.edu/trehub/
Maintained by
Aaron Sloman
School of Computer Science
The University of Birmingham