Aaron Sloman
School of Computer Science, University of Birmingham, UK
An extension to
The Meta-Morphogenesis Project
and
The Cognition and Affect Project
NOTE This is an incomplete DRAFT: Work in progress
This file is
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/vision
(Extending some of the work done
during the cosy project
and earlier work attempting to understand the biological functions of vision, e.g.
Image interpretation: The way
ahead? (1982))
What are the purposes of vision?
(1986)
Discussion Paper: Predicting Affordance Changes (2007)
And others...
INSTALLED:
09 Oct 2013
UPDATED:
18 Oct 2015 (Added garden videos, and made some formatting changes).
5 Feb 2014; 26 Mar 2014; 30 Mar 2014; 13 Apr 2014; 30 Jun 2014;6 Jul 2014
1 Nov 2013; 2 Nov 2013; 8 Nov 2013; 9 Dec 2013
11 Oct 2013; 22 Oct 2013 (relocated and reorganised);
Also look at the raw digitised data for clues. (Or distractions??)
Where possible, also look through narrow tubes (of various widths), moving each tube to point at different parts of the scene, to investigate the roles of context, saccades, etc.
The widely used slogan "Perception is controlled hallucination" is often attributed to him, though I have not found it in his publications. It overlaps with themes in these papers of his
M.B. Clowes, 'On seeing things',
in Artificial Intelligence, 2, 1, 1971, pp. 79--116,
http://dx.doi.org/10.1016/0004-3702(71)90005-1
M.B. Clowes,
Man the creative machine: A perspective from Artificial Intelligence research,
in Ed. J. Benthall,
The Limits of Human Nature, Allen Lane, London, 1973,
Extracts from these papers are discussed in an overview of his contributions to
AI in this tribute:
http://www.cs.bham.ac.uk/research/projects/cogaff/sloman-clowestribute.html#bio
-----
and discussed the problems of perceiving the 3-D structures depicted in the image. E.g. which hands belong to whom? It is not possible to answer that without using prior knowledge of human anatomy.
Compare the notion of "Unconscious inference" proposed by von Helmholtz in 1867.
This document presents my elaboration of some of his ideas (also shared by some other vision researchers, though by no means all.)
Some of the ideas are related to the work on "Shape from Shading" by Horn (1970) and the 1978 paper on "intrinsic scene characteristics" by Barrow and (Jay) Tenenbaum. There is a lot more work that is relevant, including work on use of optical flow, use of highlights, reflections and static and moving shadows, especially on curved surfaces, use of texture and in some cases coordination of visual and haptic information as a finger or palm moves over a visible surface in contact with it. The visible behaviour of materials impinging on a surface can also provide information about the surface, such as water flowing down it, or trickle of water or sand bouncing off it, and no doubt many more. Despite the vast amount of work that has already been done on investigating and modelling or replicating aspects of visual competences in humans and other animals, it may take a lot more work to unravel the achievements of billions of years of evolution.
Most of all, we need a better understanding of the functions of animal vision. Vision researchers tend to start with the assumption that the functions of vision are obvious, and the main problem is how to design mechanisms able to serve those functions. Berthold Horn (Horn 1980) wrote: "Certain kinds of operations on images appear to be dictated by the image rather than the task and ought to be done without consideration for the task. At this point, however, it seems that task-dependent representations will be with us for a while."
Perhaps some of the complexity and counter-intuitive structure, of animal retinal mechanisms and the variety of "downward" neural connections, including control of pupil size, saccades, etc. should be recognized as indications that there are classes of tasks encountered by our evolutionary ancestors, arising out of features of the sorts of environment that animals have had to deal with on this planet, that make it useful for biological vision systems to have resources that address those sorts of tasks, and without which vision as we know it might be impossible, though a different sort of image processing, based on knowledge-free analysis of image data might be useful for some engineering applications.
Most vision researchers ignore important functions of human vision, including, for example, the ability of humans to discover regularities in euclidean geometry and prove some of them, as reported in Euclid's Elements about two and a half millennia ago, perhaps building on much older achievements of biological evolution shared with other intelligent animals, such as abilities to perceive possibilities for change Sloman (1996) and constraints on possibilities for change -- together labelled 'proto-affordances' in Sloman (2008) and elsewhere. Some examples are provided in: http://www.cs.bham.ac.uk/research/projects/cogaff/misc/triangle-theorem.html http://www.cs.bham.ac.uk/research/projects/cogaff/misc/triangle-sum.html http://www.cs.bham.ac.uk/research/projects/cogaff/misc/torus.html http://www.cs.bham.ac.uk/research/projects/cogaff/misc/knots/ (All the above are reports on work in progress -- some also mentioned below.)
Another feature of Clowes' work, to which I've drawn attention to in a new appendix with some notes on his biography and publications in this tribute http://www.cs.bham.ac.uk/research/projects/cogaff/sloman-clowestribute.html#bio is the apparent use, by human visual systems, of an ontology for both image features, scene features and unobservable aspects of the environment, that is not necessarily closely related to mathematically definable features of a rectangular grid of numerical image measures. What biological evolution did was to produce designs that worked in rich and varied, but highly constrained physical environments for organisms with particular modes of interaction, including other active entities, rather than solving general problems of data-mining in a sea of data. I believe this has deep implications regarding missing aspects of current research in vision. This document is part of a long struggle to identify some those missing aspects. This is related to some of the points made in Abercrombie (1960).
Some of the trade-offs between totally general learning mechanisms and mechanisms that are products of many "design decisions taken by evolution" in increasingly complex and varied environments are discussed in this incomplete document:
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/simplicity-ontology.html Simplicity and Ontologies The trade-off between simplicity of theories and sophistication of ontologies
From the Editor's DeskI have not yet studied the report closely enough to find out how far the attempts to
Unsolved Problems...
by Arjan Kuijper...the German section of the IAPR hosted a workshop on Unsolved Problems
in Pattern Recognition and Computer Vision.
"Somewhere, in a glass building several miles outside of San Francisco, a computer is imagining what a cow looks like.Similar claims, or ambitions, have recently been announced by several projects, including some very high profile publicly-funded projects.Its software is visualizing cows of varying sizes and poses, then drawing crude digital renderings, not from a collection of photographs, but rather from the software's "imagination."
The company is weaving together bits of code inspired by the human brain, aiming to create a machine that can think like humans.
.....
The idea of creating smarter computers based on the brain has been around for decades as scientists have debated the best path to artificial intelligence. The approach has seen a resurgence in recent years thanks to far superior computing processors and advances in computer-learning methodologies.One of the most popular technologies in this area involves software that can train itself to classify objects as varied as animals, syllables and inanimate objects."
However, all the researchers I know of start with digitised images in rectangular arrays.
As far as I know, there is no evidence that any animal brain CAN perform operations on, or NEEDS to perform operations on, rectangular arrays of digitised image data.
I suspect we need to think about modelling visual brain functions by starting not from what computer technology can do, nor by believing what neuroscientists think brains do, but by asking what the information content in optical information in the environment is, and how it can be used by animals with varying needs and capabilities, and which optical and other sensors and processing devices can meet those requirements, in a world much of which is constantly in motion, viewed by animals that are in motion much of the time.
[James Gibson had some good ideas about this in 1966. He also had some bad ideas.]
Of course, I don't deny that all sorts of useful new applications can come from research that neither replicates nor explains what brains do. It has been happening for decades in AI, and earlier, e.g. in mechanical calculators that haven't a clue what a number is but do arithmetic much faster and more accurately than any human!
What differences can you see between the crane in the picture in the Meccano manual and the crane built (mainly) from plastic meccano pieces?
How does this task differ from the tasks given to machines that are trained using thousands or millions of labelled images and then required to attach labels to new images?
What do you have to do in order to answer the question about differences between the
pictures? You may prefer to use bigger versions of the pictures here:
If you were trying to design a visual system for a mobile animal in a natural environment, would you prefer to start from video cameras that produce sequences of rectangular arrays of images, or something different?
What did evolution do about this? Does anyone know?
The aim of this document is to draw attention to some of the less obvious functions of vision and the requirements that follow for well designed mechanisms of vision. I'll present some conjectures regarding functional requirements that are often not noticed, or perhaps deliberately ignored, by vision researchers, but seem to be important in human and animal vision. They are not the only important functions, but I'll present evidence that ignoring them accounts for some of the serious limitations of current machine vision (to be documented in more detail later).
Conjecture 1 (image structure):
Some of the serious limitations of current AI/Robotic vision systems
result from failure to find all the low level image structures that are
relevant to possible perceived scenes (Barrow and Tenenbaum (1978)).
Removing the limitations may require development of new kinds of feature detector for "low level" image features, new forms of representation for such features, and new ways of assembling the resulting features into descriptors on different scales, and in different "domains": image-fragment domains and scene-fragment domains, including both static scene fragments and process fragments. Some examples of "everyday" process fragments are presented in the attached collection of videos of an ordinary garden.
Conjecture 2 (scene structure):
Other limitations come from the assumption that information about
scene structure should, wherever possible, make use of metrical
information, e.g. distances to surfaces, orientations, curvatures,
lengths, areas, volumes, etc. or probability distributions over
such values in cases where the information derivable is unreliable.
Conjecture 3 (functions of vision):
The above limitations arise in part because researchers make over-
simplified assumptions about the functions of vision. Biological vision has
many different uses including detection of what J.J.Gibson called "affordances",
a notion that is substantially extended in Sloman (Talk 93).
There are also differences between requirements for control information
required for transitory online control functions (visual servoing) and
acquisition of more descriptive information that can be used for multiple
purposes immediately or at some future time.
There are additional requirements for perception of causal and functional
relationships (e.g. support, prevention), perception of intentions and
intentional actions of other intelligent agents, detection of emotional and
other states of other agents, and abilities to work out what others can and
cannot see, which can be important for predators, prey, teachers, parents, etc.
A yet more sophisticated use of human vision is understanding proofs, whether
diagrammatic or logical. as illustrated in Chapter 7 of The Computer Revolution
in Philosophy
http://www.cs.bham.ac.uk/research/projects/cogaff/crp/chap7.html
Also discussed in
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/mathsem.html
(From molecules to mathematicians.)
and in this discussion of some aspects of reasoning in Euclidean geometry:
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/triangle-theorem.html
Hidden Depths of Triangle Qualia
(We still lack a good characterisation of the functions of a "mathematical eye".)
Notes on Conjecture 1:
Conjecture 1 will be illustrated by presenting examples of images in which a great
deal of structure is visible to (adult) humans, which, as far as I know, cannot be
detected, or even represented, by most or possibly all current artificial visual
systems, and which I suspect current visual learning systems could not learn to
detect, because of assumptions built into the learning mechanisms (e.g. assumptions
about what sorts of things need to be learnt, and how they should be represented).
I'll suggest some search tasks that could indicate presence of these competences,
e.g. the task of finding where a particular scene-fragment visible in one image is
visible in another image, and being able to explain changes of image contents for the
same scene fragment, e.g. by changed viewpoint, rotation of the object, change in
reflections, change in lighting.
The images include features that I suspect are not yet used in machine vision
systems, and which seem to be required for some visual competences in humans (and
possibly other animals), e.g. answering certain questions about the scenes, e.g. Why
does this part of the object look different in these two images? Does this
fragment of image A correspond to any visible fragment in image B, where the
fragments are transformed by a change of viewpoint or illumination, or viewing
distance.
I'll start with a group of four quite difficult images that present the challenges, and then give a larger collection of images with similar challenges, some easier than others.
I am not claiming that any of these tasks are impossible for machines, only that
current assumptions about how to set goals for vision systems (especially the use of
benchmark image sets and tests) and assumptions about how to achieve the goals both
hinder the advance of machine vision, and our understanding of animal vision.
Please let me know if I am wrong -- my information could be out of date.
Notes on Conjecture 2
Conjecture 2 is partly addressed by suggesting alternatives to the use of metrical
information, e.g. use of partial orderings, e.g. nearer, further, more or less
curved, sloping more or less steeply, larger, smaller, faster, slower, containing,
overlapping more or less, changing colour, changing colour more or less quickly
across a region, or in a location at a time, etc. etc. The alternatives include use
of structural descriptions with something like a grammar. That is not a new
idea: work on image description and scene description languages began in the 1960s,
as illustrated in Kaneff (1970), Evans (1968), later work on learning structural descriptions
by Winston, and others.
A recent use of this example is in Shet et al. (2011), though they use what appear to be relatively conventional low level feature detectors and then use a grammar to specify ways in which discovered features can be combined to provide evidence for larger structures, whereas I am proposing that something like a grammar may also be required for important low level features.
Another difference is that I am suggesting giving the system the ability to create new descriptors "on the fly" when trying to find correspondences, e.g. between stereo pairs, or across temporal sequences, or across spatial gaps. E.g. which thing in the direction of the sun from a shadow could be producing the shadow? What object could have made these marks in the mud? Those temporary, ad hoc, descriptors are often discarded after use, unless a learning mechanism detects that some of them are used often in useful ways. That would require a special purpose intermediate memory for interesting things needed now that might be worth remembering in the long term.
It seems likely that brains need many different kinds of memory dealing with different time-scales, spatial scales, functions, contexts, etc.
Much of the information perceived, used, and in some cases stored, is not metrical information. For example, you don't need to know the width of A or the width of B in order to know that A is wider than B. If A is an opening in a wall and you are B and you have the goal of getting to the other side of the wall, then being able to see which is wider may suffice for a decision whether to try to get through the opening.
In some cases the partial ordering information does not suffice: an estimate of the difference may be needed, e.g. because small differences in size between mover and gap are more likely to lead to problems than large differences.
But it may not be necessary to have an exact measure of the amount by which A is wider then B. It may suffice to be able to tell that the difference between A and B is greater than C: e.g. the difference between the gap width and your width is greater than the combined width of your arms. In that case there will be more than enough clearance for motion through A. This is another use of partial orderings.
An important task for young learners may include acquiring the ability to detect situations in which there is enough clearance, and situations in which there is not enough clearance. In situations where it is not possible to determine whether there is or is not enough (hence the need for partial orderings) there may be strategies for checking by performing some action, and strategies for deciding whether a risk of contact is or is not worth taking (e.g. depending on what you are carrying, and how fragile, or how valuable, it is). That's a very different kind of learning from learning to segment images and classify objects perceived in the segments. Compare Gibson (1979).
The ability to treat differences as themselves partially ordered provides the basis for using partial orderings to add arbitrary precision to spatio/temporal information, limited only by what the available data actually supports, which can vary for different comparisons in the same scene.
Berthold Horn's ideas about shape from shading (PhD thesis, 1970) made a pioneering contribution over four decades ago (including influencing Barrow and Tenenbaum), though I suspect most of the potential of that work has still not been realised. There have also been attempts to get structural information from motion (e.g. pioneered by Shimon Ullman[REF]), though I believe the aim has generally been to derive precise locations and movements of visible fragments of rigid objects (such as buildings, statues, lamp-posts, etc.) by triangulation, and a common test of success is the ability to display views of the original scene from new viewpoints, including 'fly through' demonstrations, as in this demonstration by Changchang Wu http://ccwu.me/vsfm/
Projecting perceived scenes from arbitrary viewpoints is something most brains cannot do, though certain forms of art training help. Moreover, the ability to do it does not necessarily meet many of the other requirements of vision in an animal or robot, e.g. the ability to understand potential for change (affordances) and to explain observed facts, e.g. explaining how a mouse escaped from a cage by finding a hole in the cage bigger than the mouse.
Focusing on the wrong requirements (such as acquiring enough information to generate a 3-D video) may distract attention from far more important requirements for vision in an intelligent agent, including requirements for various sorts of information that can be used for varieties of planning, learning, reasoning, control of actions, or understanding other intelligent agents.
Notes on Conjecture 3
[To be added. Meanwhile see Sloman (Talk 93),
Sloman (1982)]
From this viewpoint, the pupil, lens, retina, Area V1 (primary visual cortex) and the optic nerve and other nerves connecting them have the function of a device for sampling the optic cone. During saccades different portions of the cone are sampled, in different viewing directions. This implies that visual information across saccades must be stored somewhere else. It also allows for important visual features to be based on relationships between information gained at beginnings and ends of saccades. See (Trehub 1991) chapters 3 and 4.
Head motion or whole body motion adds further complications: instead of a fixed optic cone having different portions sampled at different times as a result of saccades, a moving eye (on a moving head or whole body) is constantly generating new optic cones partly overlapping with previous ones. This can happen in different ways depending on the type of motion of the observer, e.g. along the line of sight, at right angles to the line of sight, or somewhere in-between, and the motion may be continuous or composed of discontinuous segments (including blink-induced discontinuities). Integrating all the information available across all these transitions is a formidable challenge, especially when the problem is wrongly characterised as building a 3-D model of all the visible surfaces that can be used to generate simulated changing views, which my brain cannot do, and which in any case fails to achieve most of the functions of vision.
Useful partial orderings may be found (1) within fragments of one image, (2) between fragments of two or more images, (2) within scene fragments at a time, (3) between scene fragments at different times, and (4) between scene fragments and image fragments, and also constructed in more complex ways where multiple image and scene fragments are perceived.
There are also partial orderings between partial orderings. E.g. over a time interval the distance between two image fragments may decrease and at a later time the distance between the two images may again decrease. But if the second decrease is greater than the first, that may be evidence of either something accelerating in the scene, or the viewer moving away from the scene, or motion taking place on a curved surface (e.g. a rotating sphere), or other possibilities where reflections, highlights and shadows are involved.
The previous paragraphs present some hypotheses regarding functions of vision and some of the information that visual perceivers may be able to use to achieve those functions. The ideas are very sketchy, and it is possible that some artificial systems I have not encountered or have not understood, already meet these requirements. I shall now present some images that might be used as examples of test cases for such systems.
The first four images below are of the same scene with changes of viewpoint, zoom level, and amount of scene shown. The challenge is to identify the scene fragment common to all four images, and explain the differences in appearance.
An additional set of images with similar challenges can be found in images here.
Figure A
Figure B
Now consider the next two pictures.
How are they related to each other and to the previous pictures?
Figure D
Being able to think about relationships between two scenes is not restricted to making comparisons between familiar structures or processes recognized in the scenes. It is also possible to see a novel structure in one scene composed of previously known elements in a particular configuration and then seek a corresponding structure in the other scene. Its appearance (or image projection) in different images may be very different, e.g. because lighting has changed, the viewpoint has changed, objects have been rotated, and the object may be flexible, and have altered its shape between the images.
Of course, seeing a complex structure typically requires perception of previously encountered image and scene fragments, and use of familiar types of spatial relationship, but the things seen and compared need not all have structures that have previously been encountered and memorised. Some people will recognise fairly large structures in the scenes depicted above, and others will not. But both groups should be able (possibly with a little difficulty) to identify physical structures that are common across the two scenes despite differences in the corresponding image structures, and despite the fact that the structural information is held only in temporary storage, since the objects have not been encountered previously.
Comment A: Zooming
There's something in the perceived structure that is not disrupted by zooming in or out,
but would be if the zooming went too far: scale invariant matching mechanisms are
important here, but they have limitations.
Are there AI vision systems that can detect the structural correspondences between the last two images, and can also detect the which portions of the first two images correspond to the items shown in the last two images?
The task seems to be easier (for humans) if all four images are displayed on the screen simultaneously. Why is that? Something to do with visual workspaces?
Try this web page to see all of the above images at once.
That page presents the following questions about the four images:
Compare the work on reasoning about pictorial analogies presented in Evans (1968), and Winston's work on learning structural descriptions from examples. Both depend on the ability to create a structural description of an object the first time it is perceived, i.e. without repeated training on that object or its depictions.
Note:
I am not claiming that all humans succeed at the same visual tasks: there are
individual, developmental, and cultural differences. So my demonstrations may fail
for some people for whom other demonstrations work. Biology is not like physics: many
examples are unique. But that does not imply that they lack explanations based on
general principles about the underlying mechanisms combined with particular facts.
Comment B (SIFT et al.):
One of the best known commonly used techniques for identifying patches of object
surface when viewed in different images is known as SIFT (Scale Invariant Feature
Transform), summarised in
http://en.wikipedia.org/wiki/Scale-invariant_feature_transform
I am not an expert in this area, but when I explored the use of SIFT while participating
in the CoSy robotic project several years ago (in 2005) I found that it was unsuccessful
at dealing with transparent objects, e.g. a plastic water bottle, with or without
water, and objects with shiny/reflective curved surfaces, for which image structures
corresponding to surface patches change dramatically with viewpoint, as in the
pictures above. A different problem was that the technique did not provide the kinds
of scene fragment descriptions that seemed to be needed for a mobile robot acting in a
complex environment, including information relevant to thinking about possibilities
for change (various types of affordance, discussed below). I do not know whether I
misjudged the technique at the time), or whether recent advances in SIFT-based
algorithms have overcome the problems. An example may be the work of Ernst and
colleagues referenced below, which they claim uses "a continuous domain
model, the profile trace, which is a function only of the topological properties of
an image and is by construction invariant to any homeomorphic transformation of the
domain." The web site demonstrates use of the method to track the bridge of a
human nose through video sequences. I don't know how general the method is.
For a brief introduction to a wider range of potentially generally useful 2-D image
features see this overview, including detection of corners, blobs, and ridges:
http://en.wikipedia.org/wiki/Interest_point_detection
NOTE added 8 Dec 2013
More recent work by Hinton and colleagues based on "deep learning'' out-performs
rivals in standard pattern recognition competitions, but I suspect those techniques
would not be sufficient for the tasks for vision systems described here. See
http://books.nips.cc/papers/files/nips25/NIPS2012_0534.pdfThe paper compares results of different machine learning systems on collections of images, but gives no indication of how they compare with humans. From the examples presented, and the performance figures, it appears that there is still a large gap between human and machine perception. The tests presented here, based on requirements for animal/human vision, seem to require different capabilities. However, I have not yet understood the mathematical details in that work, or this paper:
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoff. "Imagenet classification with deep convolutional neural networks", in Advances in Neural Information Processing Systems 26, 2012. pages 1106--1114.
http://www.cs.toronto.edu/~gdahl/papers/momentumNesterovDeepLearning.pdf
Sutskever, Ilya and Martens, James and Dahl, George and Hinton, Geoffrey.
"On the importance of initialization and momentum in deep learning". 2013. In
Proceedings of the 30th International Conference on Machine Learning (ICML).
Use of partial ordering information can also be useful for sensor data in which scalar information is not readily available, or is inherently inaccurate or uncertain. Such inaccurate and uncertain data may provide a basis for inferences about partial orderings that are accurate and certain.
For example, looking at two objects in a room you may be incapable of estimating their distances accurately, yet easily able to see that one is further away than the other. Likewise, despite uncertainty about actual illumination values, or colours of different parts of the visual field there may be certain and reliable information about gradients: in a certain direction the intensity is definitely increasing.
If the rate of increase in different directions can also be compared it may be possible
to find with high confidence, though perhaps low precision, the directions of maximum
and minimum rate of increase, or how the direction of maximum intensity varies,
e.g. curving to the right in some places, to the left in others, and sometimes not
curving. For example it is not difficult to locate the upper and lower bounds, or the
left and right bounds of the letter "Q" below, without having precise measures.
Q
Presenting mathematical techniques to extract partial ordering information from digitised images is beyond the scope of this document, whose main point is to present examples of static monocular images in which humans see a great deal of 2-D and 3-D structure, structure which, as far as I know, cannot be found by any existing AI vision systems. I suspect the training methods currently used would not enable those systems to learn to perform the tasks here.
The partial occlusion/occultation relations between adjacent blocks, along with the assumption that the blocks are all rectangular and there are no perspective tricks generate a collection of "above/below", "further/nearer" and "left of/right of" relationships that cannot all be simultaneously satisfied.
If any non-corner block is removed from the picture and replaced by the background colour, the scene depicted is no longer impossible because a chain of implications is broken by removing some of the occlusion relations.
However, as the image is we can see it as depicting a rich 3-D scene with many affordances, e.g. spaces where a finger or some other object could be inserted between blocks, or the possibility of extracting any of the blocks, rotating it by 90 degrees along any of its axes, then reinserting it; and also with the possibility of swapping any two of the blocks while leaving the others unchanged. All those are local collections of affordances (possibilities for change) "embedded" in the scene. For more on perception of affordances and possibilities for change see this Discussion Paper: http://www.cs.bham.ac.uk/research/projects/cogaff/misc/changing-affordances.html Predicting Affordance Changes (2007).
In the Reutersvard picture the complete set of perceived structures and possibilities is inconsistent. I don't know whether any non-human animals have the ability to detect that. Training an ape, or young child, to use wooden blocks to create a structure depicted in an image, and then presenting the Reutersvard triangle might be an interesting experiment. Compare doing that with a future robot. The Reutersvard picture is discussed in more depth in connection with the roles of visual functions in (some0 mathematical discoveries here: http://www.cs.bham.ac.uk/research/projects/cogaff/misc/impossible.html
More of Reutersvard's pictures can be seen here.
Here are pictures of an impossible triangle, which actually exists, using an idea originally suggested by Richard Gregory:
However, if all the spatial relationships are represented as relative distances (or heights) between neighbouring portions, with no actual distances assigned, then a complete information structure can be built even though what it represents is inconsistent. That may appear to be a flaw in the form of representation using partial orderings rather than absolute values. Instead I claim that in most situations, where impossibilities are not depicted, the use of relative rather than absolute values enormously simplifies the information processing and produces a more useful result because it smoothly survives a variety of physical changes.
I suspect a young child will not notice any inconsistency (as adults fail to do for more complex impossible objects). Noticing the impossibility requires development of additional meta-cognitive mechanisms able to inspect and check the results of visual processing: only then can the impossibility be recognized, as with a set of N statements, like an instance of the following pattern, any subset of which is consistent:
Object D2 is above D1 Object D3 is above D2 Object D4 is above D3 ....... ....... Object Dn is above Dn-1 Object D1 is above DnThe ability to detect that locally consistent collections of information are globally inconsistent is probably a later evolutionary development, though there are many unanswered questions about how the information structures using only partial orderings not absolute values are used, for example in controlling actions. For some actions they would be inadequate, such as a cat jumping up from the ground to the top of a wall, and other ballistic actions requiring accurate measures at the start of the actions.
Contrast the "impossible staircase" video
http://www.youtube.com/watch?v=cKRuYDYFLJI.
(This seems to use cinematographic trickery.)
There may be some features of images that our visual systems use but we are incapable
of noticing.
In his work on interpretation of 2-D line drawings of 3-D polyhedral scenes, Clowes
borrowed ideas from linguistics, describing the 2-D domain as "syntactic" and the 3-D
domain as "semantic". The analogies between language understanding and visual
perception were being actively explored in the 1960s (e.g. Kaneff 1970). However,
in spoken and written language there are additional domains of structure, for example
concerned with acoustic/phonetic/phonological phenomena in the case of spoken
language and concerned with points, edges, strokes, blobs, and letters in the case of
written language. (Sign languages seem to be more complex than either.) It seemed to
me that human perception of 3-D scenes also typically involved far more than two
domains of structures, especially scenes with motion and causation.
As a result of my exposure in the early 1970s to work on natural language processing,
including speech understanding and attempts to get machines to read printed and
hand-written text, it was clear to me that AI visual systems could also (in this case
after much learning) make use of multiple domains with very different contents. To test
ideas about how such visual systems could work we chose a very simple word
recognition project including domains of dots, various dot groups, lines, pairs of
parallel lines, plates shaped like letters, and words. The hope was that we could
later extend the ideas to phrases and sentences, and then generalise to other visual
scenes. (Lack of funding cut the project short in the late 1970s, however.)
Popeye
These ideas were developed in the Popeye vision project (so-named because we used the
Edinburgh University AI language POP-2 http://en.wikipedia.org/wiki/POP-2).
The project demonstrated how perception of noisy images of overlapping capital
letters represented by 'plates' could use multiple ontologies processed in parallel
and interacting in a mixture of bottom up, top down and middle out processing, as
explained in Chapter 9 of The Computer Revolution in Philosophy (1978)
http://www.cs.bham.ac.uk/research/projects/cogaff/crp/crp#chap9
Different domains of structure (ontologies) in the POPEYE 'vision' system
Dots, groups of dots, lines, pairs of parallel lines, flat overlapping plates,
letters formed from straight strokes, words, ...
(These different ontologies refer to different levels of structure in the
environment, not increasingly abstract patterns in the mind of the perceiver, or
patterns in the low level sensory data. The sensory
data are treated as 'projections',
possibly noisy projections, from entities in the environment. Working out a
good way to interpret those projections is non-trivial. (See the challenge
here That sort of use of vision is essential for its biological role:
organisms need to find food, avoid being eaten by predators, an in some
cases look after their young: their information processing systems did not
evolve to give them experiences that they interpret in that way. They need
to actually change the world, not just experience changes.)
Note and image, Added 8 Dec 2013
The main architecture of the Popeye system was designed and implemented by David
Owen and Aaron Sloman, circa 1975-7. Additional contributions were provided by
Geoffrey Hinton, who added a neural net for recognising words from partially
recognised letter sequences, replacing a hand-coded classifier.
This is one of the more challenging test images, produced by overlapping the
letter-plates and adding positive and negative noise. In some cases the word depicted
could be correctly identified before all the letters had been found. Likewise a
letter could sometimes be correctly identified (partly using top-down information)
before all its contributing parts in the image had been found.
Note added 8 Dec 2013
The Popeye program was hand-coded, on the basis of human analysis of requirements for
solving the problem. It is not known whether any of the machine learning algorithms
since developed for machine vision would enable a machine learn to use the
multi-layer ontology, the architecture, or the algorithms used in the Popeye design,
or some alternative, possibly more general, or more powerful, combination of ontology,
architecture, and algorithms. (Compare the work on "deep learning" by Hinton et. al.
referred to above.)
Can these ideas be generalised?
Can we generalise these ideas about multiple domains to small fragments of rich
images of 3-D scenes with varying structures?
The visual system will need to be embedded in a complex multi-layered architecture
with different routes through the system from sensors to effectors for different
purposes all operating in parallel and in some cases with sensors and effectors
collaborating closely (as noted by James Gibson in
The Senses Considered as Perceptual Systems (1966).
The system could be further elaborated in the CogAff architecture schema and its
varied instances
http://www.cs.bham.ac.uk/research/projects/cogaff/#overview
including both "multi-window" perception and "multi-window" action using concurrent
interacting streams, deeply integrated with deeper systems evolved at different
stages in evolutionary history.
Does this blur the vision/cognition distinction?
Some researchers object to characterising the discovery of all the above information
as part of the function of vision. I don't think there is any point arguing over
boundary disputes of that sort. Rather the substantive issues concern what happens to
information that comes in through eyes, how it is represented, how it is combined
with or related to information from other senses (haptic information, proprioceptive
information, and vestibular information (information about accelerations of the
viewer detected by semi-circular canals). Moreover much of the information that could
be claimed to be non-visual is in registration with the visual field, which suggests
that special linkages between unarguably visual data-structures and mechanisms and
more central forms of representation were produced either by evolution or forms of
learning.
Do you see a difference between the eyes in the two faces?
Stare for a while at the eyes in one face, then switch to the eyes in the other
face -- alternating after a pause.
Do you see a difference in the "expression" in the eyes?
This suggests a visual system that has deep connections with more central mechanisms
required for hypothesising internal states (e.g. mental states, including emotions),
some of which are represented by internal information structures in the perceiver in
registration with visual information structures. Hence the eyes, or mouth, can be
smiling or sad.
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/knots
Knot challenges (photos and questions)
What does a visual system need to be able to do in order to enable someone to answer
the questions about knots in these pictures?
(Added 4 Jun 2014)
Summary Conjecture
It seems that current machine vision systems do not use a sufficiently rich and
varied collection of low level descriptors for image and scene fragments, and as
a result the features they do find do not suffice as a basis for the information
an intelligent perceiver needs.
I conjecture that biological evolution produced animal visual systems that look
for a far broader class of "sub-feature" image fragments (i.e. fragments that
are smaller than the visible features such as dots, edges, corners, texture
patches, ridges, grooves, bumps, etc.) along with a variety of compositional
relationships among those features that are used to grow larger, more useful,
information structures across several layers of organisation.
For this the visual mechanisms use a sort of visual grammar for indivisible
components along with modes of composition to produce larger information
structures. The result is that the system has resources to acquire store and use
a far wider variety of low-level image, scene, and process fragments than
current AI systems are capable of producing.
Moreover the fragments are used in biological systems not to produce models or
descriptions from which the original images can be reconstructed, or which can
be used to derive images presenting views of the scene from changing viewpoints
(as some AI 3-D vision systems can do). Animals don't (in most cases) need to
generate movies of what they see.
Rather, they use perception to perform a large collection of more subtle and
more varied functions, including: describing causal connections, describing
possible changes in the scene and what the consequences of those changes would
be, describing constraints on changes and possible ways of altering the
constraints, and many more aspects of the scene that are relevant to what can or
cannot be done, or what can or cannot happen. In some cases, the scene fragments
can be used in 'online' visual control of actions, i.e. visual servoing, for
instance constantly deciding how to vary the trajectory of a hand moving to
grasp a twig. This use of 'affordances' in the environment was emphasised by
James Gibson, though I think he noticed only a small subclass of a wide class of
phenomena involving many types of affordance, including vicarious affordances,
epistemic affordances, and deliberative affordances, as discussed in this
presentation on the functions of vision:
http://www.cs.bham.ac.uk/research/projects/cogaff/talks/#gibson
For static images and scenes the very low level ("sub-feature") fragmentary
descriptions would merely summarise what exists in various places, whereas for
moving images and scenes the mechanisms would produce temporally ordered
fragments and then derive descriptions of process-fragments and go on to explore
ways of combining the fragments into information about changing scenes,
including affordances, causal interactions, looming threats, new possibilities, etc.
What are the sub-feature information fragments?
Finding out what forms the information fragments take is a substantial research
challenge. Many researchers would simply assume that they are either image pixel
contents representable by numbers representing scalar values, or else
combinations of numerical values represented as vectors or arrays. What else
could they be? Certainly not small geometrical shapes, since those are
themselves composed of smaller features, such as edges corners, interior
exterior, line-width, area, etc. What else could they be?
One possibility may be networks of partial orderings. For example, instead of
recording intensity values at particular locations (which are likely to be
noisy and inaccurate, record directions in which the values are increasing or
decreasing most strongly and directions in which there's no noticeable change
(contour features). Other features might be whether curvature is increasing or
decreasing or fixed in a particular direction. Another information fragment
might record whether some feature is open or closed, like small closed contour,
or whether the closed contour is symmetrical or squashed in a certain
approximate direction, or whether one sort of feature is inside or outside a
curve feature.
A revolutionary proposal: describe kinds of matter?
Is it possible that in addition to recording just image fragments, scene
fragments, and fragmentary image and scene processes, animal visual systems also
encode information about the kind of matter of which the visible objects are
composed. Some evidence for that could come from static visible features, e.g.
indicating wood, iron, string, hair, water, etc. But others might come from the
dynamic behaviour over time, indicating more or less viscous fluids, more or
less flexible strings, etc. (Note PhD of Veronica E.
Arriola-Rios, available soon.
(Draft version Oct 2013)
How would such features be recorded in brains? We could look for patterns in the
firing of neurons to see what they can encode. Or we might look for information
encoded in molecular, sub-synaptic structures: a potentially much richer
"descriptive language".
Of course, that raises deep and difficult questions about how the molecular
structures are set up, how they are communicated, how they are used, how long
they endure, etc.
I have presented, above, some examples of images that at present AI vision systems
can't process in the sort of way that humans do (as far as I know). There are
probably millions more examples available on the internet. A challenge is to
devise forms of processing of the images that achieve far more than current AI
techniques can, in providing information about image and scene contents that
might be useful for a robot (as opposed to being shown on a display to impress
humans wanting to know about machine vision).
Don't just assume that visual input is a rectangular array of numbers, or RGBNB:
values. That's what your computer may receive as input, but that's just an
accident of the technology developed so far. Brains get time-varying collections
of photons from the optic array (Gibson), which are sampled both in very high
resolution in a very small area, and in much lower resolution, in a surrounding
area. Moreover the sampling centre is constantly being relocated, and as a
consequence also the surround. The relocation may be due to spontaneous saccades
or to tracking a moving object. Moreover there isn't just a single form of
processing, or a pipeline of processing. Instead the sampled information --
high and low resolution -- is constantly being transmitted to different brain
regions that may be looking for different information in parallel.What they look for and how they process what they find, may partly
be determined by some fixed brain mechanisms, partly altered by variable
thresholds and filters, and partly changed because different mechanisms for
actively deriving and combining information get turned on or off, or combine
their outputs in different ways.For example different adjacent information samples may be used to derive (i.e.
not merely combined to form) a new description of what's in a particular place,
or may be added to locations in histograms building summaries of how much
information of various sorts has recently been coming in, without recording
exactly where.
This ability seems to depend on the existence of meta-cognitive mechanisms supporting
the ability to attend to intermediate fragments of information in a multi-layered
visual system while it processes visual information at various levels of abstraction,
using different ontologies. This is partly like the ability of a skilled linguist to
attend to aspects of heard dialects or individual differences in pronunciation that
are not noticed by ordinary speakers of the language.
It is possible that many animals share similar human visual capabilities while
lacking the architectural basis for inward directed attention of the sort referred to
here. Investigating such species differences is not easy, with species that cannot
talk about what they experience, but ingenious experimenters may come up with
something. It's also possible that such capabilities are not present during early
stages of visual development in humans, and only start to grow themselves after the
non-introspective visual mechanisms have reached a certain stage of development.
I suspect that an adequate account of human (and some animal) visual information
processing mechanisms and architectures will turn out to be closely related to
mathematical capabilities. Of course, many of the evolved mechanisms are available
also to blind mathematicians, whose blindness is due to peripheral abnormalities,
and that may provide important clues about the functions of normal vision.
Why mathematics? The answer is quite complex.
J.J.Gibson drew attention to the fact that a major function of perception in animals
is to provide information about what is and is not possible for them. This was a very
different high level view from the more common view (e.g. in David Marr's 1982 book
on vision) that the function of vision is (roughly) to compute the reverse of the
projection process that starts with light leaving surfaces and ends with detection of
image fragments by retinal mechanisms. Marr's theory emphasises acquiring information
about reflective surfaces in the environment, e.g. their distance from the viewer,
curvature, orientation, colour, reflectance, and how they happen to be illuminated.
Gibson, by contrast, emphasises the importance of acquiring information about what
can and cannot be done by the perceiver in order to achieve goals and avoid unwanted
states, e.g. colliding with something, getting stuck in a small gap, failing to grasp
something. An obvious third view is a synthesis of the two views which generalises
both Marr's and Gibson's theories. However, I have been arguing for several decades
that even the combination of their views on the functions of vision is not general
enough.
Part of the reason is that the environment is generally neither static, nor in a
fixed relationship to the viewer, since sighted animals use their vision when moving
(one Gibson's important observations). Moreover, perception of processes can provide
far more information than the sum of what Marr and Gibson mentioned. For example, if
you see a person holding a part of something while moving it, the behaviour of the
thing held can provide much information about the kind of material, e.g. whether it
is rigid or flexible, and if flexible, whether stiff or not elastic or not,
stretchable, or not, and even how light the material is. For example, consider a
person rigidly grasping something long and thin in a hand that moves up and down. The
behaviour of cotton, string, rope, rubber, wire, and wood will be different, and
there are different sub-cases for each of those categories. Likewise the process of
folding something will produce visibly different motions if the material is tissue
paper, printer paper, wrapping paper, kitchen foil, clingfilm, towelling, etc. The
work of
G. Johansson,showed that even seeing only the motion of lights attached to joints of moving people
Visual perception of biological motion and a model for its analysis, in
Perception and Psychophysics, 14, 1973, pp. 201--211,
Unfortunately, Gibson focused only on information about what is immediately possible
and relevant to the organism's needs, whereas it is clear that humans (including
pre-verbal children, to some extent) and some other species can acquire information
not only about what is the case in a scene, or what actions are possible for them,
but also what other processes can occur under various conditions, whereas in other
conditions the processes will be rendered impossible -- e.g. walking through a
doorway after the door has been shut.
I claim, as argued here
I am claiming that the ability to reason about lines, angles, circles, polygons,
etc., and their properties as found in Euclidean geometry (and illustrated in other
documents in this web site) is (a) closely related to the ability to see possibilities
and impossibilities in addition to what exists, (b) important for human and some
non-human animal perception and reasoning, (c) requires visual mechanisms and forms
of representation of spatial information not yet found in AI/Robotic theories, or in
psychology or neuroscience.
Unfortunately there is reason to suspect that something other than current forms of
computation may be needed to support those biological geometrical capabilities,
though I ma' not sure. What is required is not mechanisms to simulate and predict
consequences of spatial structures and processes, but to detect sets of possibilities
and constraints. This seems to require a new collection of meta-cognitive
capabilities, some of them shared with other species, and pre-verbal children, as
discussed in
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/toddler-theorems.html
In contrast, I am not claiming that in dealing with my examples the vision system
always makes use of structures previously learnt, to label detected structures.
Rather the visual system creates 'on the fly' new concepts, or new descriptions,
related to a task, and sometimes uses them to find and use information in other parts
of the same scene or scenes encountered soon after.
So when the system says "the object visible in this portion of this image (e.g.
bounded by a particular curve in the image) is the same as the object in that portion
of that image" that need not make use of recognition of the object as fitting a
known specification. Rather a temporary, task relevant, novel specification is
created to fit the object and then used to re-identify the object in another image.
Then the specification may be discarded because the object will never be seen again.
This is a bit like using results of monocular processing to do stereo matching, which
can't be done with Julesz figures but can be done with more natural stereo pairs.
A piece of science fiction
So here's a piece of science fiction: when asked to find structures in images the
visual system looks for possibly interesting fragments of an image and assembles new
chemical formulae representing them. Then it uses those molecular models to guide
searches for things they can match in other places in the same image and in new
images.
E.g. in this fiction a fragment description might roughly translate into an
English phrase:
An elliptical shape with the appearance of a slightly squashed tube with a shiny yellow surface curved round almost forming a toroid, with an opening on the left, immersed in liquid reflecting a bright light above and to the right.The next description might have the toroid's opening on top, suggesting that either
This is science fiction because evolution is unlikely to have produced very low level
descriptors that correspond to terms in a recently developed human language. So we
may have to find new low level feature labels and a 'grammar' for combining them to
form structural descriptions in a primitive visual description language shared
perhaps with other species, since it is clear that many others have powerful visual
capabilities, e.g. hunting mammals, nest-building birds, elephants, apes, squirrels
and many others.
This description should be done very rapidly, e.g. during fixations between saccades,
i.e. without any training involved (though it may use previous training, or even just
previous evolution). This requires the ability to put new temporary descriptions into
a short term (working) memory, rather than storing learnt descriptions in a long term
trained memory.
I can begin to see how rapidly assembled chemical formulae could produce a rich
enough language for this purpose, but I have no idea what mechanisms could go from
retinal stimulation to molecular construction. However, I don't see how neural
mechanisms can do it either.
But perhaps this is all science fiction.
Note that my hypothesised requirements for rapid construction of novel structural
descriptions is also a requirement for comprehension 'on the fly' of spoken or written
language - e.g. reading a sentence like this one for the first time, understanding
it, then perhaps throwing it away as uninteresting...
Things still to be done include (in no particular order).
Some experiments on this are described in:
http://www.cs.bham.ac.uk/research/projects/cogaff/misc/multipic-challenge.pdf
But the same innately specified designs for learning mechanisms can be
instantiated in very different ways during individual development (epigenesis) as
sketched in Chappell and Sloman (2007). A result may be that the same perceived scene
is seen by different viewers in very different ways, as made up of different
components interacting concurrently,
A piano-teacher watching a young pianist may need to think about posture, how
wrists are held, various aspects of performance including phrasing, speed,
variations in speed, how detached or smooth notes are (staccato, legato),
appreciation of the global structure of the work, making concurrent themes clear
in a polyphonic episode, switches of mood, and many more. All of this is
required in order for the teacher to help the pupil develop in various
dimensions.
Must probabilistic reasoning about statistical relationships between inputs and
outputs, or between various aspects of inputs at different levels of
abstraction, play a significant role in the teacher's mental processes?
Compare Minsky's comments on Beethoven in this extraordinary BBC Radio 3 interview
http://web.media.mit.edu/~minsky/BBC3.mp3
By contrast, we attempt to immediately assign three-dimensional interpretations to intensity edges to initialize processing at the level of intrinsic images, and we maintain the relationship between intensities and interpretations as tightly as possible. In our view, perfecting the intrinsic images should be the objective of early visual processing; edges at the level of the primal sketch are the consequence of achieving a consistent three-dimensional interpretation.
Acknowledgments
I have discussed these topics over many years with various colleagues in
Birmingham doing research in vision, AI, and robotics, including especially
Jeremy Wyatt.
Since 2011, I have also been discussing these issues with
Visvanathan Ramesh http://fias.uni-frankfurt.de/neuro/ramesh/ Visvanathan Ramesh
A recent PhD student, Veronica E. Arriola-Rios, has done pioneering preliminary work
on learning to perception objects deforming in response to forces. (Thesis shortly to
be made available.)
In the distant past I discussed vision at Sussex university, with Max Clowes,
Robin Stanton, Alan Mackworth, Christopher Longuet Higgins, David Owen, Frank
O'Gorman, Geoffrey Hinton, Larry Paul, Steve Draper, David Hogg, Geoff Sullivan,
and others. Some of us built the Popeye vision system described in chapter 9 of
The Computer Revolution in
Philosophy. (1978).