Aaron Sloman - Vision Levels

MULTI-LEVEL VISION: POSSIBLE EMPIRICAL TESTS

Aaron Sloman
School of Computer Science, Univ. of Birmingham, UK
http://www.cs.bham.ac.uk/~axs

Last updated: 5 May 2006; 5 Feb 2014; 9 Sep 2017 (format+additions); 9 Aug 2018

This paper is
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/multi-vision.html
A PDF version may be added later.

A partial index of discussion notes in this directory is in
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/AREADME.html

Background

Over many years I have been arguing that the complexity and diversity of the functions of vision have been underestimated, both by vision researchers (in AI, psychology, neuroscience, and philosophy) and especially AI researchers who assume that a vision component, or more generally a perceptual component, can be a simple front-end to an architecture containing all the interesting bits. A clue to that mindset is an architecture diagram that has perception restricted to a small box from which arrows emanate to other parts of the system, in contrast with the H-CogAff architecture schema that has multi-layered perception and action systems with bi-directional links from different layers in perception and action subsystems to to more central multi-layered subsystems.

For examples see
Chapter 9 of The Computer Revolution in Philosophy (1978)
Image interpretation: The way ahead? (1982)
On designing a visual system: Towards a Gibsonian computational model of vision. (1989)
What are the purposes of vision?
Evolvable Biologically Plausible Visual Architectures (2001)
Virtual Machine Functionalism (VMF)
Perception of impossibility
Predicting Affordance Changes
(Steps towards knowledge-based visual servoing)
Various presentations on architectures and on vision

Around 2005, as a result of working on requirements for ontologies and representations to be used by a robot manipulating 3-D objects, I began to realise that things were even more complex than I had been claiming, because the analysis showed a need for a visual system to be able to represent multiple concurrent processes at different levels of abstraction in partial registration with the optic array.

Early versions of these ideas were presented in
COSY-PR-0505: A (Possibly) New Theory of Vision (PDF)
COSY-PR-0506: Two views of child as scientist: Humean and Kantian (PDF)
COSY-DP-0601: Orthogonal Recombinable Competences Acquired by Altricial Species (Blankets, string, and plywood) (HTML)

Possible experimental tests

In an email discussion of some of the issues Alain Berthoz asked me to suggest experiments that might be done to test some of the ideas, instead of relying so much on analysis of design requirements. What follows is my first draft response (written 4 May 2006, but likely to be updated in the light of further thoughts and critical comments received.)

It may be that the only suggestions I can make are too shallow. Here are some first thoughts.

Ebbinghaus Illusion
One experiment that has already been done concerns the Ebbinghaus illusion.
Summarised and illustrated in this web site by Holly Gerard.
People will make incorrect judgements expressed verbally about the relative sizes of circles surrounded by other circles. Those judgements conflict with measures of anticipatory movements when subjects are about to grasp the central circular object. The grasping movements are more accurate.
That suggests that different information is being used or different means of processing (or both) for different functions:

control of action (using visual servoing)
production of explicit descriptions of what is in the environment, e.g. in order to communicate with others or to record in memory for future use, or for planning of future actions.
(This is often wrongly described as a 'where' vs 'what' distinction.)

It is predictable that the second function, control of action, will require more detailed and precise information, though perhaps the information exists only in a transient state within a control system for grasping -- or more generally for on-line control of actions, including kicking, avoiding, pushing out of the way, hitting with a held stick, using as a target for thrown objects, etc.
If these different kinds of representation are created and used in different parts of the brain there is no apriori reason why the processing should not be done in parallel.
Different kinds of information used in different ways in different sorts of biological organisms are discussed in section 3 of this unfinished paper What the brain's mind tells the mind's eye. (PDF)

I wonder whether there are variants of the Ebbinghaus test involving moving objects, e.g. circles changing size or groups of circles rotating, or moving while people both look at them and say what is going on and report relative sizes (or speeds?) while simultaneously performing actions related to grasping e.g. using the computer to control the width of a pair of parallel line segments that if continued would be tangential to a circle under observation. E.g. a joystick with compressable knob might be used to move the two lines, the position of the joystick controlling the location of the two lines and the amount of squeezing pressure on the knob controlling the gap between them.
Of course, the latter is not the same task as controlling actual grasping, but my guess would be that it might use similar visual servoing, employing a relatively precise metrical representation of spatial discrepancy between the boundaries of objects to be grasped and the grasping surfaces as part of the control mechanism, whereas describing the global relations between groups of objects involves a relatively high level and abstract representation of categories of shapes and topological relations, augmented by relative size and angle descriptions of what is in the scene without involving any action control parameters.

Example inspired by hunting mammals
A more complex example could be related to a hunting mammal and a herd of prey animals: e.g. a lion and a herd of deer.
The lion has two different though related tasks. One is to decide which animal to chase, which may involve lying in the grass and watching the spatial layout and relative sizes and speeds of the animals in order to select one as a target. In order to make that a computationally feasible task that can be done fairly quickly, the brain must operate at a high level of abstraction, ignoring much of the precise visible detail of size, shape, orientation, and detailed activities of each animal.
In particular the relative distances and directions in the retinal image are not the required relative distances and directions between individual deer, trees, bushes, etc.
A sophisticated visual system might be able to compute a map-like viewpoint-independent representation of relative distances, since the viewpoint-dependent representation (as provided on the retina) does not give the most useful information concerning relative distances, though it does give relative directions.
In contrast, chasing the selected target could make heavy use of 2-D relationships in the visual field (like a sailor using alignment of two landmarks to steer a straight course into a harbour). As the target is approached, more and more precise metrical information of different kinds is required to prepare either for a leap to bite the victim or a swipe of the hind legs using the lion's paw: both very complex actions requiring sophisticated control of timing, spatial location, force, etc.
As stated above, if these different kinds of representations are created and used in different parts of the brain there is no apriori reason why the processing should not be done in parallel, though there may be constraints that prevent this, e.g. in the heat of the chase, not least because when hunter and hunted are close the more global visual information previously used is not available any more.
I wonder if some sort of experimental setup could be created in which a person watches a computer screen with a display of red and blue moving squares each with a label (e.g. a letter), where some squares disappear and new ones appear, and a triangle exists that the subject can control with a joystick.
The subject has to perform two tasks in parallel, namely (a) commenting on some changing global relationships e.g. how many blue squares there are, which one is closest to the triangle, which blue square is closest along an unobstructed route (i.e. not blocked by red squares), and perhaps reporting whenever three or more of the blue squares are collinear, and (b) trying to move the triangle to 'catch' a blue square, which requires both path planning and plan execution without necessarily describing what is going on.
In the suggested experiment, various control parameters could affect the relations between joystick movements and movements of the controlled object. I don't know how much difference that would make: e.g. using joystick displacement to control velocity vs acceleration.
A variant of the task might include some of the objects moving behind others and then reappearing. If some of them move in straight paths, some in curved paths, anticipating the emergence in order to capture an object will require different 'simulations' of what is happening while the object is hidden.
It may also be possible to provide acceleration and deceleration for some of the 'prey' objects.
I would expect people could use information about curvature and acceleration/deceleration to say roughly when and where temporarily hidden objects will reappear, but if the blue objects remain visible while crossing the red objects far more precise predictions will be made and people will be able to cope with faster moving displays, because different perceptual and control subsystems will be able to use the continuous feedback.
One condition (object invisible) uses mechanisms for predicting and reporting. The other uses mechanisms for control with visual servoing. These are different processes.
There may be individual differences in people's ability to do different tasks in parallel. I have observed that if a person washing up dishes is asked a question, there are two kinds of responses. Some people continue with the washing while answering and continuing the discussion, while others seem to freeze while answering.
On the other hand most people who can drive a car seem to be capable of simultaneously driving the car, which requires quite precise control, and also having a discussion, including talking about the scenery, a helicopter visible in the distance, the behaviour of the traffic, etc.
NOTE:
In the UK the Institute of Advanced Motorists has a special driving test that requires drivers to be able to comment on many things happening in front of them and things visible in the rear view mirror at the same time as driving. I don't know whether performance in this test has been the subject of any empirical research by psychologists.

I have noticed that when reversing my car into a narrow space, such as our garage, I simultaneously use visible clues as to alignment of parts of the car with other stationary objects and also my knowledge of the locations of objects I cannot see but whose size I know, roughly.
I use both kinds of information, but in different ways: the visible alignments and discrepancies are used for fine-grained control and the other information is used for more qualitative control -- 'I must be getting near the box so I should stop now'.
I don't know how easy it would be to set up experiments that deal with real-life situations.

Other mobile figures
A different sort of experiment could involve something like a rotating necker cube, where subjects have to watch out for specific 3-D events e.g. when one of the edges reaches its maximum distance, and for 2-D events, e.g. when two of the lines intersect.
There could also be tests of predictive abilities, e.g. the display disappears and the viewer has to press a button to indicate when an event would have occurred if the display had continued. This could be done for both 3-D events and 2-D events. The subject need not know in advance which sort of event will be probed. However I am not sure how easy it will be to distinguish use of two representations and use of one representation that has two functions.
Maybe Johansson moving-light movies of people fighting, dancing, climbing up and down ladders, etc., would be more appropriate than a rotating rigid structure like a wire frame cube.
Johansson, G. (1973). Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, 14, 201211. this presentation (requires Powerpoint or OpenOffice)
A couple of demos are included here.
And an adjustable one here.

More thought is required to turn these suggestions into good experiments.
I think there are more subdivisions of cases that could be explored using only topological as opposed to metrical information, or changing functional or causal relationships, e.g. supporting, pushing, grasping.

Related papers and presentations

Arnold Trehub, 1991, The Cognitive Brain,
MIT Press, Cambridge, MA,
Online PDF http://people.umass.edu/trehub/

Aaron Sloman http://www.cs.bham.ac.uk/research/projects/cosy/papers/orthogonal-competences.html
http://www.cs.bham.ac.uk/research/projects/cosy/papers/sensorimotor.html

Maintained by Aaron Sloman
School of Computer Science
The University of Birmingham

MULTI-LEVEL VISION: POSSIBLE EMPIRICAL TESTS

Aaron Sloman School of Computer Science, Univ. of Birmingham, UK http://www.cs.bham.ac.uk/~axs

Background

Possible experimental tests

Aaron Sloman
School of Computer Science, Univ. of Birmingham, UK
http://www.cs.bham.ac.uk/~axs