Requirements for Vision Systems

VISION AND ACTION
REQUIREMENTS FOR SEEING THE REAL WORLD
Aaron Sloman
Installed: 7 Mar 2009
Last updated: 2 Oct 2009; 5 Jun 2012; 4 Nov 2012
A recent presentation on these topics is http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#gibson What's vision for, and how does it work? From Marr (and earlier) to Gibson and Beyond Recent papers related to the Meta-Morphogensis project include discussions of visual processes in mathematical reasoning, among other aspects of the evolution of visual functions and mechanisms. http://tinyurl.com/CogMisc/meta-morphogenesis.html http://tinyurl.com/CogMisc/evolution-info-transitions.html http://tinyurl.com/BhamCog/ triangle-theorem.html

I have been trying to work out what I would do if I had a team of
outstanding vision researchers with whom I could work for the next
few years (three to five years). What follows is a partial, draft,
set of answers, which will be updated from time to time.

People who specialise on vision research do not regard me as a
vision researcher, and there is some justification for that, insofar
as I spread myself very thinly over many topics, and I do not read
most of the published vision research reports. Nevertheless I have
been thinking about, reading about and writing about vision for over
30 years, including chapter 9 of
    The Computer Revolution in Philosophy
    I list some more papers, presentations, and discussions on vision
     at the end of this file.

This work has mostly been about requirements for human-like
or animal-like vision systems, rather than specific designs,
although the details of requirements do suggest constraints on
designs, and indicate some minimal architectural features, as shown
crudely here (2nd page).


Notice however that that gives merely one view of a complex
multi-level, multi-functional, dynamical system. A different view is
being developed within the CogAff project based on the variety of
types of architecture that can be accommodated within the CogAff
schema

Different ways of filling in the schema will put different
mechanisms in the boxes and different connections between the
mechanisms. The lowest layer is found only in the simplest
organisms. The middle layer evolved much later, under pressure to
represent and reason about what doesn't exist. The top layer
probably evolved in parallel with the other two (and makes use of
them). It is concerned with meta-semantic competences: abilities to
represent and reason about things that represent and reason, with
obvious implications for self-monitoring and self control.

The mechanisms do not all exist at birth in humans: they grow in carefully
controlled phases using delayed development, for reasons explained in:
Natural and artificial meta-configured altricial information-processing systems (2007).

The vision and action columns are also layered because evolution
discovered the need for perceptual and motor subsystems concerned
with acquiring and using information about the environment at levels
of abstraction corresponding to the different ontologies and
functions in the different layers. So waving to someone is an action
that requires meta-semantic competences and would be at least partly
under the control of the top, meta-management layer.

Likewise, seeing happiness or sadness in a face, or seeing an
intention in an action requires meta-semantic competences.

These meta-semantic perceptual, thinking, and action competences are
complex, but not necessarily more complex than abilities to perceive
and think about complex 3-D structures and processes in the
environment. E.g. ask yourself why it is that when a bolt goes
through a fixed nut, if the bolt is rotated about its axis that
makes it translate along its axis. Some more examples are in this
short discussion note
http://www.cs.bham.ac.uk/research/projects/cogaff/challenge.pdf
"Perception of structure: Anyone Interested?"

Some disagreements with prevalent views

My work on vision has mainly been concerned with identifying
requirements for human-like or animal-like vision, including
requirements that will need to be met by visual systems in
intelligent robots that are currently far beyond the state of the
art in machine vision.

This work has led me to disagree with three widely held assumptions
regarding functions of vision, (1)-(3) below, and one widely used assumption
about good means to achieve those functions (4) below:

(1) I don't agree with the widely held assumption that the main function
of a 3-D vision system in an animal or a robot is recognizing objects:
recognition is a secondary function, which results from seeing.
There are many situations in which we can see an object, and even do
things like pick it up, jump on it, avoid touching it, break it
apart, prod it, push it, etc., even though we do not recognise the
whole object either as being an instance of a known category, or
as being a previously encountered individual. So we need to make
object-recognition occur as a by-product of seeing, not as the main
or most basic function of seeing.

It is also important to stress that perception is at least as much
about processes as about objects. Biological visual systems
did not evolve to cope with a series of snapshots.

Animals exist in and interact with an environment in which many
processes of different sorts occur, including processes in which
object change their properties (e.g. shape or colour), their spatial
relationships and their causal relationships and interactions.
Furthermore these changes may be metrical, or qualitative,
geometrical or topological, and may preserve or change complexity
(e.g. as objects are combined to form more complex objects, or
disassembled to form a larger collection of simpler objects).
Perceiving these processes should not be confused with recognition.

There are several issues concerning visual perception of 3-D objects
that are not being addressed in part because of the excessive focus
on recognition. One way to appreciate those problems is to consider
how humans perceive objects they do not recognize. The proposal to
study perception of polyflaps grew out of this requirement.

(2) I don't think 3-D vision should be thought of as producing
some sort of internal model replicating or representing all the
details of the scene, in such a way as to enable images of the scene
to be generated by projection to different viewpoints. (This is one
of the standard tests for success of a 3-D stereo system, but I
think it is a misguided test).

My brain cannot do that, yet I see a great deal of 3-D structure,
and a great many processes in which 3-D structures are created or
changed. That seems to be true of most people and animals with good
vision. A small subset of individuals can learn to draw or paint
what they see, but that is relatively rare.

Examining things humans can do with pictures of impossible objects
helps to undermine this 'isomorphic model-construction' view of 3-D
vision. Some examples can be found here (PDF)

(3) Most vision researchers, in AI, psychology, etc. assume that
vision is concerned with detecting what exists in the environment.
This ignores the very important collection of issues first
identified by J.J.Gibson which he described in terms of perception
of affordances.
J. J. Gibson, The Ecological Approach to Visual Perception,
Houghton Mifflin, Boston, MA, 1979,

Detailed examination of Gibson's examples, and further investigation
of functions of vision indicates that a great deal of human vision
is concerned not with what actually exists in the environment but
with processes and objects that do not exist, but could exist,
including both processes that could occur or be prevented as a
result of actions of the perceiver (these involve affordances) and
processes that could occur or be prevented by other things, e.g.
something blowing in the wind, or being moved by gravity, or by
another agent (I call these "proto affordances").

A paper investigating some of the logical and philosophical
implications of this is online here:
'Actual Possibilities', in Principles of Knowledge Representation
and Reasoning: Proceedings of the Fifth International Conference (KR `96)
Eds L.C. Aiello and S.C. Shapiro 627--638. 1996

(4) (Added 2 Oct 2009) Most vision researchers, in AI, psychology, etc. appear
to assume that spatial locations, distances and angles are represented within a
single global coordinate system, where

(a) distances between items in the scene use a common metric so that everything
can, for example, be expressed in cm., or multiples of some other fixed unit
of length,

(b) positions have coordinates relative to some common origin, where the
coordinates make use of the common distance metric
and
(c) orientations in space have measurable angles relative to axes of that global
coordinate system.

I suspect using a uniform, global, system of metrics and a coordinate system based
on cartesian or polar co-ordinates is only something done by humans with a
mathematical, scientific or engineering education; and cannot be done by young
children or other animals.

Instead, a young human child or animal develops a web of semi-metrical spatial
relationships in each scene where lengths or distances are measured relative to
other things in the scene, using partial orderings, e.g. X is longer than Y, X is
longer than Z, the distance from P to Q is more than twice and less than three
times the distance from R to S, etc. (This ability to estimate the number of times
one length, or a difference in length, fits into another length is what I refer to
as using a semi-metrical extension to a partial ordering.)

The precise details of how this works, how the form of representation is learnt,
and how the the information thus expressed is used, are all topics for further
research. (See the presentation on ontologies for baby robots below, for more
information.)

What are the functions of vision?

Exactly what the functions of vision in animals are, and what the functions should be
in intelligent robots, is a hard unsolved research topic on which more work needs to
be done so that we have much richer sets of requirements against which to
evaluate proposed designs.

I have been working on collecting requirements for a long time, and trying to
organise them into different categories. But I think there is still a long way to go.

My paper for the Dagstuhl workshop on vision in February 2008 is one of several
attempts to get clear about this, and I still think I am missing important
requirements.

http://www.cs.bham.ac.uk/research/projects/cosy/papers/#tr0801a
Architectural and representational requirements for seeing
processes, proto-affordances and affordances.

An earlier paper was presented at a vision workshop in 1986
http://www.cs.bham.ac.uk/research/projects/cogaff/12.html#1207
What are the purposes of vision.

In particular, I think there are three major functions of vision to be distinguished,
that are shared with other animals, and some additional ones that are unique to
humans.

Three major functions of vision

1. visual servoing -- online control of actions involving production or
prevention or alteration of 3-D processes of various kinds. This uses transient,
constantly changing information.

This is sometimes mistakenly referred to as the "where" function of vision,
assumed to be the role of the "dorsal" visual stream.

2. Producing factual, descriptive, re-usable, information that endures for
different time-scales, about processes and structures in the environment, with
perception of processes as probably more important than perception of structures.

This is often mistakenly referred to as the "what" function of vision, assumed to
be the role of the "dorsal" visual stream. Since the factual information can
include location, orientation and spatial relationships, it can be as much
"where" as "what" information. The alleged distinction ignores the facts.

(Milner and Goodale later recommended switching from the what/where terminology
to a perception/action distinction, which I think is also a mistake. Visual
servoing includes both action and vision.)

3. Producing information about what is not occurring, or does not exist but
could occur or exist in the environment, and seeing constraints on
such possibilities.

This can be subdivided into seeing proto-affordances, seeing action-affordances,
seeing epistemic-affordances, and limitations of epistemic affordances (e.g.
seeing that information is not available, or that it is imprecise, etc.)

In many cases, perceiving such affordances involves recognising what kind of
stuff (material) things are made of -- e.g. rigid, flexible, elastic,
impenetrable, fragile, squishy, heavy, hard, soft, liquid, powdery, etc. Many of
these are not properties that can be directly sensed. They often need to be
inferred from perceived results of actions (i.e. perceived processes).

Examples of possible processes that are hard to see and easy to see, (at least
for adult humans) can be found here.

Additional functions of vision, that build on those

4. Seeing causes and effects of things that happen or could happen.

a. Seeing why something happens or happened involves reasoning
about causes and finding explanations,
e.g. seeing that something is being moved because something
else is pushing it.

b. This is related to but different from predicting what will
happen, e.g. a moving object will hit an obstacle.

It seems that such reasoning can use visual structures and
visual mechanisms in some cases, and logical or other
non-visual information in other cases.

NB: these affordances are seen as directly related to perceived
parts, features and relations, especially relations between
surface fragments and to possible processes.

So they should not be thought of as involving abstract
inferences based on recognition of object categories, e.g.
"That's a handle so it is graspable", "that's a door so it is
openable", etc.

Instead, seeing something as graspable involves seeing how two
or more controllable surfaces can be moved so that the object
comes to be between them, and if the two surfaces are then moved
towards each other the object will be gripped, so that thereafter
it will move together with the controllable surfaces.
How all that might be expressed in the mind of an child, an
robot, or a chimpanzee is an open research questions.

5. Seeing other things in the environment as 'sentient' with abilities to have
intentions, perform actions, and have responses to things happening in the
environment.

E.g. seeing in which direction someone is looking, seeing what someone is looking
at, seeing what someone is doing, seeing what someone is trying to do, seeing
that someone is failing to achieve a goal, etc. This includes something like
adopting what Dennett calls "the intentional stance" or using what Newell called
"the knowledge level". But it need not assume rationality, as they claim.

6. Seeing and understanding communications.
That can include reading written text, understanding gestures, reading music,
reading mathematical notation or program code, reading maps, etc.

NOTE ADDED 10 Mar 2009 (Revised 10 Jul 2009):
PDF slides presented at a number of workshops and seminars
recently elaborates on some of these points:

http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#brown
Ontologies for baby animals and robots
From "baby stuff" to the world of adult science: Developmental AI
from a Kantian viewpoint.

I don't expect any project to achieve all of those, or even to aim for all of them.
But I think it is important when researching on subsets of the functions of vision to
pay attention to what the full range of functions is, so that work done on the
subsets can be informed by the requirement to be used later on as part of a more
general system.

Otherwise, there is the risk that work done on subsets will not 'scale out' to
interface with other subsets, and will therefore have to be discarded when more
ambitious projects are attempted.

It may be desirable to develop a research project specifically to identify long term
requirements for visual systems that could be the basis of a partially ordered
scenario-based roadmap for vision research (which will also necessarily involve
research on other functions that interact with vision systems). Some ways of thinking
about such roadmaps are indicated in this diagram:

Taken from this presentation:
What's a Research Roadmap For? Why do we need one? How can we produce one?
euCognition Research Roadmap meeting, January 2007.

If anyone is interested in collaborating on trying to assemble more complete
requirements for future vision systems, to provide the context for the work to be
done in the near future, then I would be very interested to hear suggestions,
including suggestions for collaboration. However, I do not intend to apply for
funding for research in this area. I shall go on doing it anyway, time-sharing with
other research activities.

Papers, presentations and discussion notes on vision

Papers (including book chapters)

Chapter 9 of The Computer Revolution in Philosophy.
"Perception as a Computational Process"
(Including an overview of the Popeye program, developed with David Owen, Frank O'Gorman an Geoffrey Hinton.)

Image Interpretation, The Way Ahead? (1982)
Invited talk at an international symposium organised by The Rank Prize Funds, London, Sept 1982.
The proceedings were published in Physical and Biological Processing of Images, Editors: Oliver J. Braddick and Andrew C. Sleigh. Pages 380--401, Springer-Verlag 1983

On Designing a Visual System: Towards a Gibsonian computational model of vision.
In Journal of Experimental and Theoretical AI 1,4, 289-337 1989

"How to design a visual system -- Gibson remembered"
(Not available online except.)
Jointly written with David Vernon, in Computer vision: Craft, Engineering, and Science,
Ed. D. Vernon, Springer Verlag, 1994.

Evolvable, Biologically Plausible Visual Architectures
In Proceedings British Machine Vision Conference 2001, pages 313-322.
Presentations on vision (PDF files)

When is seeing (possibly in your mind's eye) better than deducing, for reasoning?
Presented at CS & AI Theory seminar, Birmingham, Sept 2001
Also at BCS/SGAI meeting, City University London, March 2006

Evolvable, Biologically Plausible Visual Architectures
Presented at BMVC01 (British Machine Vision Conference, Sept 2001).

Human Vision --- A multi-layered multi-functional system
Presented at a symposium of the British Machine Vision Association (BMVA) http://www.bmva.ac.uk/ on Reverse Engineering: the Human Vision System Biologically inspired Computer Vision Approaches London, 29 January 2003.

Requirements for Visual/Spatial Reasoning
Talk to Language and Cognition seminar, School of Psychology, Birmingham, Oct 2003

A (Possibly) New Theory of Vision (PDF)
Presentation given in several places in October 2005 and following months. Emphasised importance of perception of processes.
Closely related to Two views of child as scientist: Humean and Kantian

Architectural and representational requirements for seeing processes and affordances
Talk at: BBSRC funded Workshop on
Closing the gap between neurophysiology and behaviour: A computational modelling approach
University of Birmingham, United Kingdom
May 31st-June 2nd 2007

Seeing Possibilities: A new view of Empty Space
Talk at: Intelligent Robotics Lab Seminar, Birmingham, 22nd Jan 2008

http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#dagstuhl08
Talk at Dagstuhl workshop on vision, Feb 2008

http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#toddlers
Links between biological mechanisms required for vision in young animals exploring a complex 3-D environment and development of mathematical competences.

http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#brown
Ontologies for baby animals and robots From "baby stuff" to the world of adult science: Developmental AI from a Kantian viewpoint.

http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#gibson
What's vision for, and how does it work?
From Marr (and earlier) to Gibson and Beyond
Presented at Birmingham Vision Club (School of Psychology), 17th June 2011, and a Vision/Robotics workshop (Sheffield University Psychology dept.) 23rd June 2011
Discussion notes on vision (HTML, plain text and PDF)

What the brain's mind tells the mind's eye.
Incomplete draft paper, begun in 2002.

http://www.cs.bham.ac.uk/research/projects/cogaff/challenge.pdf
"Perception of structure: Anyone Interested?"
Things you see you can do with a cup, saucer and spoon.
Some hard challenges for 3-D vision systems, easy for humans.
Prepared as part of requirements study for the CoSy Project.

http://www.cs.bham.ac.uk/research/projects/cosy/photos/crane/
Problems posed by pictures of a toy plastic meccano crane.

http://www.cs.bham.ac.uk/research/projects/cogaff/challenge-penrose.pdf
Problems posed by pictures of impossible objects.

Perceiving polyflaps
The domain of polyflaps and other domains for acting and learning.

Predicting Affordance Changes
(Alternatives ways to deal with uncertainty) (2007)
Discussion of ways in which perception of action affordances and perception of epistemic affordances can be combined as an alternative to using probabilities to deal with uncertainty.

http://www.cs.bham.ac.uk/research/projects/cogaff/misc/multipic-challenge.pdf
Informal experiment providing some architectural requirements for a human like visual system (based on the speed at which you see things at many levels of abstraction as you turn a corner, or come up from a Metro station, in an unfamiliar city.

See also the vision sections of my Doings file.

Maintained by Aaron Sloman
School of Computer Science
The University of Birmingham