School of Computer Science THE UNIVERSITY OF BIRMINGHAM CoSy project CogX project

VISION AND ACTION
REQUIREMENTS FOR SEEING THE REAL WORLD
Aaron Sloman
Installed: 7 Mar 2009
Last updated: 2 Oct 2009; 5 Jun 2012; 4 Nov 2012
A recent presentation on these topics is
    http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#gibson
    What's vision for, and how does it work?
    From Marr (and earlier) to Gibson and Beyond

Recent papers related to the Meta-Morphogensis project include discussions of visual
processes in mathematical reasoning, among other aspects of the evolution of visual
functions and mechanisms.
    http://tinyurl.com/CogMisc/meta-morphogenesis.html
    http://tinyurl.com/CogMisc/evolution-info-transitions.html
    http://tinyurl.com/BhamCog/ triangle-theorem.html

I have been trying to work out what I would do if I had a team of
outstanding vision researchers with whom I could work for the next
few years (three to five years). What follows is a partial, draft,
set of answers, which will be updated from time to time.

People who specialise on vision research do not regard me as a
vision researcher, and there is some justification for that, insofar
as I spread myself very thinly over many topics, and I do not read
most of the published vision research reports. Nevertheless I have
been thinking about, reading about and writing about vision for over
30 years, including chapter 9 of
    The Computer Revolution in Philosophy
    I list some more papers, presentations, and discussions on vision
     at the end of this file.

This work has mostly been about requirements for human-like
or animal-like vision systems, rather than specific designs,
although the details of requirements do suggest constraints on
designs, and indicate some minimal architectural features, as shown
crudely here (2nd page).

Notice however that that gives merely one view of a complex multi-level, multi-functional, dynamical system. A different view is being developed within the CogAff project based on the variety of types of architecture that can be accommodated within the CogAff schema
CogAff
Different ways of filling in the schema will put different
mechanisms in the boxes and different connections between the
mechanisms. The lowest layer is found only in the simplest
organisms. The middle layer evolved much later, under pressure to
represent and reason about what doesn't exist. The top layer
probably evolved in parallel with the other two (and makes use of
them). It is concerned with meta-semantic competences: abilities to
represent and reason about things that represent and reason, with
obvious implications for self-monitoring and self control.

  The mechanisms do not all exist at birth in humans: they grow in carefully
   controlled phases using delayed development, for reasons explained in:
    Natural and artificial meta-configured altricial information-processing systems (2007).

The vision and action columns are also layered because evolution
discovered the need for perceptual and motor subsystems concerned
with acquiring and using information about the environment at levels
of abstraction corresponding to the different ontologies and
functions in the different layers. So waving to someone is an action
that requires meta-semantic competences and would be at least partly
under the control of the top, meta-management layer.

Likewise, seeing happiness or sadness in a face, or seeing an
intention in an action requires meta-semantic competences.

These meta-semantic perceptual, thinking, and action competences are
complex, but not necessarily more complex than abilities to perceive
and think about complex 3-D structures and processes in the
environment. E.g. ask yourself why it is that when a bolt goes
through a fixed nut, if the bolt is rotated about its axis that
makes it translate along its axis. Some more examples are in this
short discussion note
http://www.cs.bham.ac.uk/research/projects/cogaff/challenge.pdf
    "Perception of structure: Anyone Interested?"

Some disagreements with prevalent views

My work on vision has mainly been concerned with identifying
requirements for human-like or animal-like vision, including
requirements that will need to be met by visual systems in
intelligent robots that are currently far beyond the state of the
art in machine vision.

This work has led me to disagree with three widely held assumptions
regarding functions of vision, (1)-(3) below, and one widely used assumption
about good means to achieve those functions (4) below:

(1) I don't agree with the widely held assumption that the main function
    of a 3-D vision system in an animal or a robot is recognizing objects:
    recognition is a secondary function, which results from seeing.
    There are many situations in which we can see an object, and even do
    things like pick it up, jump on it, avoid touching it, break it
    apart, prod it, push it, etc., even though we do not recognise the
    whole object either as being an instance of a known category, or
    as being a previously encountered individual. So we need to make
    object-recognition occur as a by-product of seeing, not as the main
    or most basic function of seeing.

    It is also important to stress that perception is at least as much
    about processes as about objects. Biological visual systems
    did not evolve to cope with a series of snapshots.

    Animals exist in and interact with an environment in which many
    processes of different sorts occur, including processes in which
    object change their properties (e.g. shape or colour), their spatial
    relationships and their causal relationships and interactions.
    Furthermore these changes may be metrical, or qualitative,
    geometrical or topological, and may preserve or change complexity
    (e.g. as objects are combined to form more complex objects, or
    disassembled to form a larger collection of simpler objects).
    Perceiving these processes should not be confused with recognition.

    There are several issues concerning visual perception of 3-D objects
    that are not being addressed in part because of the excessive focus
    on recognition. One way to appreciate those problems is to consider
    how humans perceive objects they do not recognize. The proposal to
    study perception of polyflaps grew out of this requirement.

(2) I don't think 3-D vision should be thought of as producing
    some sort of internal model replicating or representing all the
    details of the scene, in such a way as to enable images of the scene
    to be generated by projection to different viewpoints. (This is one
    of the standard tests for success of a 3-D stereo system, but I
    think it is a misguided test).

    My brain cannot do that, yet I see a great deal of 3-D structure,
    and a great many processes in which 3-D structures are created or
    changed. That seems to be true of most people and animals with good
    vision. A small subset of individuals can learn to draw or paint
    what they see, but that is relatively rare.

    Examining things humans can do with pictures of impossible objects
    helps to undermine this 'isomorphic model-construction' view of 3-D
    vision. Some examples can be found here (PDF)

(3) Most vision researchers, in AI, psychology, etc. assume that
    vision is concerned with detecting what exists in the environment.
    This ignores the very important collection of issues first
    identified by J.J.Gibson which he described in terms of perception
    of affordances.
       J. J. Gibson, The Ecological Approach to Visual Perception,
       Houghton Mifflin, Boston, MA, 1979,

    Detailed examination of Gibson's examples, and further investigation
    of functions of vision indicates that a great deal of human vision
    is concerned not with what actually exists in the environment but
    with processes and objects that do not exist, but could exist,
    including both processes that could occur or be prevented as a
    result of actions of the perceiver (these involve affordances) and
    processes that could occur or be prevented by other things, e.g.
    something blowing in the wind, or being moved by gravity, or by
    another agent (I call these "proto affordances").

    A paper investigating some of the logical and philosophical
     implications of this is online here:
        'Actual Possibilities', in Principles of Knowledge Representation
        and Reasoning: Proceedings of the Fifth International Conference (KR `96)
        Eds L.C. Aiello and S.C. Shapiro 627--638. 1996


(4) (Added 2 Oct 2009) Most vision researchers, in AI, psychology, etc. appear
    to assume that spatial locations, distances and angles are represented within a
    single global coordinate system, where

   (a) distances between items in the scene use a common metric so that everything
       can, for example, be expressed in cm., or multiples of some other fixed unit
       of length,

   (b) positions have coordinates relative to some common origin, where the
       coordinates make use of the common distance metric
   and
   (c) orientations in space have measurable angles relative to axes of that global
       coordinate system.

   I suspect using a uniform, global, system of metrics and a coordinate system based
   on cartesian or polar co-ordinates is only something done by humans with a
   mathematical, scientific or engineering education; and cannot be done by young
   children or other animals.

   Instead, a young human child or animal develops a web of semi-metrical spatial
   relationships in each scene where lengths or distances are measured relative to
   other things in the scene, using partial orderings, e.g. X is longer than Y, X is
   longer than Z, the distance from P to Q is more than twice and less than three
   times the distance from R to S, etc. (This ability to estimate the number of times
   one length, or a difference in length, fits into another length is what I refer to
   as using a semi-metrical extension to a partial ordering.)

   The precise details of how this works, how the form of representation is learnt,
   and how the the information thus expressed is used, are all topics for further
   research. (See the presentation on ontologies for baby robots below, for more
   information.)

What are the functions of vision?

Exactly what the functions of vision in animals are, and what the functions should be
in intelligent robots, is a hard unsolved research topic on which more work needs to
be done so that we have much richer sets of requirements against which to
evaluate proposed designs.

I have been working on collecting requirements for a long time, and trying to
organise them into different categories. But I think there is still a long way to go.

My paper for the Dagstuhl workshop on vision in February 2008 is one of several
attempts to get clear about this, and I still think I am missing important
requirements.

    http://www.cs.bham.ac.uk/research/projects/cosy/papers/#tr0801a
    Architectural and representational requirements for seeing
    processes, proto-affordances and affordances.

An earlier paper was presented at a vision workshop in 1986
    http://www.cs.bham.ac.uk/research/projects/cogaff/12.html#1207
    What are the purposes of vision.

In particular, I think there are three major functions of vision to be distinguished,
that are shared with other animals, and some additional ones that are unique to
humans.

Three major functions of vision

 1. visual servoing -- online control of actions involving production or
    prevention or alteration of 3-D processes of various kinds. This uses transient,
    constantly changing information.

    This is sometimes mistakenly referred to as the "where" function of vision,
    assumed to be the role of the "dorsal" visual stream.

 2. Producing factual, descriptive, re-usable, information that endures for
    different time-scales, about processes and structures in the environment, with
    perception of processes as probably more important than perception of structures.

    This is often mistakenly referred to as the "what" function of vision, assumed to
    be the role of the "dorsal" visual stream. Since the factual information can
    include location, orientation and spatial relationships, it can be as much
    "where" as "what" information. The alleged distinction ignores the facts.

    (Milner and Goodale later recommended switching from the what/where terminology
    to a perception/action distinction, which I think is also a mistake. Visual
    servoing includes both action and vision.)

 3. Producing information about what is not occurring, or does not exist but
    could occur or exist in the environment, and seeing constraints on
    such possibilities.

    This can be subdivided into seeing proto-affordances, seeing action-affordances,
    seeing epistemic-affordances, and limitations of epistemic affordances (e.g.
    seeing that information is not available, or that it is imprecise, etc.)

    In many cases, perceiving such affordances involves recognising what kind of
    stuff (material) things are made of -- e.g. rigid, flexible, elastic,
    impenetrable, fragile, squishy, heavy, hard, soft, liquid, powdery, etc. Many of
    these are not properties that can be directly sensed. They often need to be
    inferred from perceived results of actions (i.e. perceived processes).

    Examples of possible processes that are hard to see and easy to see, (at least
    for adult humans) can be found here.

Additional functions of vision, that build on those

 4. Seeing causes and effects of things that happen or could happen.

    a. Seeing why something happens or happened involves reasoning
       about causes and finding explanations,
       e.g. seeing that something is being moved because something
       else is pushing it.

    b. This is related to but different from predicting what will
       happen, e.g. a moving object will hit an obstacle.

       It seems that such reasoning can use visual structures and
       visual mechanisms in some cases, and logical or other
       non-visual information in other cases.

       NB: these affordances are seen as directly related to perceived
       parts, features and relations, especially relations between
       surface fragments and to possible processes.

       So they should not be thought of as involving abstract
       inferences based on recognition of object categories, e.g.
       "That's a handle so it is graspable", "that's a door so it is
       openable", etc.

       Instead, seeing something as graspable involves seeing how two
       or more controllable surfaces can be moved so that the object
       comes to be between them, and if the two surfaces are then moved
       towards each other the object will be gripped, so that thereafter
       it will move together with the controllable surfaces.
       How all that might be expressed in the mind of an child, an
       robot, or a chimpanzee is an open research questions.

 5. Seeing other things in the environment as 'sentient' with abilities to have
    intentions, perform actions, and have responses to things happening in the
    environment.

    E.g. seeing in which direction someone is looking, seeing what someone is looking
    at, seeing what someone is doing, seeing what someone is trying to do, seeing
    that someone is failing to achieve a goal, etc. This includes something like
    adopting what Dennett calls "the intentional stance" or using what Newell called
    "the knowledge level". But it need not assume rationality, as they claim.

 6. Seeing and understanding communications.
    That can include reading written text, understanding gestures, reading music,
    reading mathematical notation or program code, reading maps, etc.

NOTE ADDED 10 Mar 2009 (Revised 10 Jul 2009):
    PDF slides presented at a number of workshops and seminars
    recently elaborates on some of these points:

    http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#brown
    Ontologies for baby animals and robots
    From "baby stuff" to the world of adult science: Developmental AI
    from a Kantian viewpoint.


I don't expect any project to achieve all of those, or even to aim for all of them. But I think it is important when researching on subsets of the functions of vision to pay attention to what the full range of functions is, so that work done on the subsets can be informed by the requirement to be used later on as part of a more general system. Otherwise, there is the risk that work done on subsets will not 'scale out' to interface with other subsets, and will therefore have to be discarded when more ambitious projects are attempted. It may be desirable to develop a research project specifically to identify long term requirements for visual systems that could be the basis of a partially ordered scenario-based roadmap for vision research (which will also necessarily involve research on other functions that interact with vision systems). Some ways of thinking about such roadmaps are indicated in this diagram: roadmap Taken from this presentation: What's a Research Roadmap For? Why do we need one? How can we produce one? euCognition Research Roadmap meeting, January 2007. If anyone is interested in collaborating on trying to assemble more complete requirements for future vision systems, to provide the context for the work to be done in the near future, then I would be very interested to hear suggestions, including suggestions for collaboration. However, I do not intend to apply for funding for research in this area. I shall go on doing it anyway, time-sharing with other research activities.

Papers, presentations and discussion notes on vision

Papers (including book chapters) Presentations on vision (PDF files) Discussion notes on vision (HTML, plain text and PDF) See also the vision sections of my Doings file.


Maintained by Aaron Sloman
School of Computer Science
The University of Birmingham