WHY VISUAL SYSTEMS PROCESS SKETCHES Aaron Sloman and David Oven [*1] Cognitive Studies Programme, School of Social Sciences, University of Sussex, Brighton, BN1 9W, England Abstract Why do people interpret sketches, cartoons, etc. so easily? A theory is outlined which accounts for the relation between ordinary visual perception and picture interpretation. Animals and versatile robots need fast, generally reliable and "gracefully degrading" visual systems. This can be achieved by a highly-parallel organisation, in which different domains of structure are processed concurrently, and decisions made on the basis of incomplete analysis. Attendant risks are diminished in a "cognitively friendly world" (CFW). Since high Levels of such a system process inherently impoverished and abstract representations, it is ideally suited to the interpretation of pictures. 1. Is the study of impoverished pictures relevant to 'real' vision? AI vision work concerned with pictures, including digitised photographs. straight-line drawings and cartoons, etc. has recently been criticised as irrelevant to visual perception of objects in the environment, Clocksin [1978]. Related themes can be found in Horn [1978]. It can be argued that studying impoverished pictures with great Local ambiguity leads to overemphasis on top- down, knowledge-guided visual processes! as in Shirai [19751 and Minsky [19751, and on complex control structures, as in POPEYE [*2]. Lack of detail in artificial images causes difficulties of interpretation which, it may appear, do not arise in ordinary perception, where disambiguating detail is provided by colour, stereopsis, optical flow, etc. Admittedly images interpreted by most A.I. programs Lack many features available even in monocular perception of static scenes, from which useful information can be extracted with powerful algorithms and computational resources. [Marr 1976, Horn 1978 and papers cited therein]. Horn's claim: 'we may have closed our eyes to the raw image for too Long', is reasonable, and supported by his own excellent work on images. But we mustn't now close our eyes to all else. Extraction of Low level image and scene features is but a sub-process of the visual mechanism. That a powerful subsystem is normally used does not imply that it is essential for vision. Stereopsis certainly occurs, and needs to be . explained, but our ability to perform everyday tasks with just one eye also needs explanation. Similarly, we can often recognise things when detail is missing, or spurious information added, through poor visibility, eye defects, strong back-lighting, restricted view angle, or intervening shrubbery. Normally, we use perceived detail to segment the scene into objects, but sometimes the grouping must go beyond consideration of image continuities and discontinuities because of occultation of some objects by others, camouflage, shadows or spurious juxtapositions. ALL this suggests considerable modularity: various sub-systems produce information, perhaps partly duplicated by other sub-systems, and Less precise information may suffice if the ideal is not available. This modularity could allow a component which ideally should be driven by the data, to be driven instead by prior knowledge activated by other data. This might explain both our facility with impoverished pictures and the occurrence of misperceptions even in excellent conditions (well-known examples are the hollow mask [Gregory 1970], and the triangle containing "PARIS IN THE THE SPRING"), How can we function so well when so much detail is lost? Recognition of sketchy drawings can be rapid and effortless [Hochberg 1978, page 193, citing Ryan and Schwartz]. Perhaps, when we look at pictures, intermediate results of the interpretation process are similar to some intermediate (sketchy) results of the processes of normal perception? Perhaps normal perception uses mechanisms with built-in characteristics designed to cope with abnormal, specially difficult, situations? Our central idea is that visual systems process many different domains of structure in parallel. So analysis of relatively "high-level", abstract, incomplete, representations, sometimes occurs & parallel with detailed analyses of visual data. [*3] Higher Level processes would then be driven in part by prior knowledge of specific sorts of objects (e.g. generalised cylinders, humanoid figures), lowers mainly by very general (implicit) knowledge about 3-D surfaces, Lighting, motion, etc. Occasionally such high- level processes would reach conclusions which are overturned by more detailed analysis, e.g. the "double take". However the different processes would normally produce compatible results, making possible the modularity referred to above. How? A basic assumption is that the visual system has evolved to work in a "Cognitively Friendly World", a CFW, (which may be very unfriendly in other respects). Here are examples of cognitive friendliness: (A) The optic array is rich in useful information about the environment --as noted above. This is due in part to the sorts of surfaces objects have, in part to a plentiful supply of short wave-length radiation and a transparent atmosphere. (N.B. the Last two conditions are very variable.) (B) The space of physically possible objects and processes is sparsely instantiated in the actual world (unlike science fiction), i.e. there is Limited independent variation of features and relations: this makes images redundant. This is illustrated by planarity, continuity, rigidity, etc. (Marr [1979]) and the fart that no animal has the ear of a zebra and the body of a giraffe. (C) Confusing coincidences (e.g accidental alignments and juxtapositions) are rare. This depends both on the kind of environment and on the Low probability of such viewpoints for any given scene. To make use of (A) a visual system needs good detectors for features of the optic array. Since these depend on Laws of physics they don't vary much from one part of the world to another and tan be usefully compiled into hardware. If we have evolved mechanisms to take advantage of (A), might we not also have evolved mechanisms to take advantage of (B) and (C)? Using (6) requires using knowledge of what sorts of objects actually occur, e.g. knowing about cylinders, about rigidity, and about zebras and their ears. Some of this (e.g. many objects are Locally rigid) is useful in nearly all environments, and might be built into genetically determined mechanisms, whilst some (e.g. what sorts of plants, animals, or buildings, are common) will vary considerably and must be Left to individual learning. Making use of (C) involves having good process organisation, to find the 'best' percepts [Hinton 1977]. A consequence of (B) and (C) is that usually any good interpretation of a visual image will be unique, and therefore the best one. (B) and (C) could also justify higher level processes jumping to knowledge-guided conclusions on the basis of partial results from Lower levels. This could enable good decisions to be made in poor viewing conditions, and in good conditions would enable decisions to be made faster. (All of this is demonstrated in a very simple world, by the Popeye program [*2].) So, assumption (A) is of use in good viewing conditions where objects are unfamiliar, whilst 1st (B) and (C) are of use where conditions are bad but objects are familiar. A system designed with this flexibility might acquire a speed advantage where all of (A) (B) and (C) are satisfied, if different sub-systems work in parallel. It would still have to be basically data-driven (bottom-up) if serious mistakes are to be avoided, but it need not be pass-oriented, with each layer waiting for lower levels to "complete" their analysis (if completion has any meaning). If higher Level systems can operate thus on impoverished data available in adverse conditions, and on incomplete, partial, results of lower levels in good conditions, they should also be able to interpret some highly impoverished artificial data, such as we find in pictures. If so, the interpretation of pictures is not merely a culturally specific, learned, process. If ordinary perception of objects and relationships requires learning, then interpretation of pictures of the same objects will not normally require additional learning, on this view: toddlers we have observed respond naturally to cartoon drawings of familiar situations, without anything Like the struggle which characterises learning to read. [Cf. Hochberg and Brooks [1962] This is not the theory criticised by Gombrich [1960] and Goodman [1969] that realistic paintings and drawings produce the same visual stimulus as the things depicted. 2. Unarticulated, semi-articulated, articulated representations. We have claimed that vision requires far more than efficient detection of features of the optic array, and that several different domains of structure are processed. To explain why, we must ask: what is vision needed for? An animal, or robot, uses perception to make decisions in pursuit of its goals and to tell whether they have been achieved. It also needs to detect unexpected dangers and opportunities. All this requires construction of representations which articulate the environment into objects with properties and relationships of varying sizes and degrees of abstractness. Rarely will the detection of a particular feature in a particular Location on the retina be very significant. Similarly, huge data-bases of unarticulated information, like depth-maps, surface colour or texture maps, surface orientation maps, primal sketches [Marr 19761, can be of Little use without considerable further processing. They are effectively new, enhanced, images, even though they may contain 3-D information. Though important for further processing, these unarticulated databases are not directly useful for decision and action: only generalisations related to global image statistics can be learned or invoked e.g. 'lots of green', but not 'plum on tree', might be recognised. To some extent groupings of fragments of information into larger wholes can be achieved by parallel "local" computations, e.g. relaxation techniques linking items subject to constraints [Hinton 1977, Radig 1978, Frisby and Mayhew this conference]. If the links exist without explicit description of the properties and relations of the linked groups, the database is semi-articulated. The process of growing such links may enable some useful global statistics to be collected, but represents objects only implicitly. Though providing a useful intermediate stage, a semi-articulated database does not explicitly represent one object as above, inside, between, or able to fit into, others. Such information is then not available for deciding, planning and learning. (Compare Marr's 'principle of explicit naming' [Marr 19761. The same point was made in Minsky [1961].) Further study of visual articulated representations requires analysis of types of actions performed by different animals. (Some birds can Learn to use a foot to depress one end of a lever, exposing food behind the other end. This probably involves articulating the lever into parts, e.g. ends, capable of different though causally linked motions.) It seems unlikely that a small number of mathematically simple structures (e.g. generalised cylinders) with a small number of mathematically simple relationships (e.g. equations linking co- ordinates) will suffice for human perception. Besides crumpled newspapers (despaired of by Marr [19791) we see fields and forests. Similarly, cluttered scenes made even of "clean" cylinders will have messy structure at Larger scales, like Large sets of axioms in a theorem-prover's database. To discern significant objects and relations in Large and messy collections of image and scene features we need a much richer descriptive vocabulary than A1 vision programs have hitherto incorporated. This is why multiple domains are important. 3. Multiple domains Clowes [1970,1971] and Stanton [1970] stressed that visual perception and picture interpretation do not simply involve description of image structures. They described "mapping rules" linking different non-isomorphic domains. A domain is a class of structures defined by a "grammar" or set of axioms (e.g. 3-D Euclidean geometry). Scenes have quite different "grammars" from images. This needs to be generalised (as in Hearsay and Popeye) to allow many domains, with different though possibly overlapping grammars. Very briefly, this is because using many different domains allows: (a) 'structure sharing' between processes of recognising different sorts of objects, (b) intermediate results of processing to be relatively secure even if back-tracking is required at higher levels, (c) higher Levels to recognise important scene features before lower Level processing is complete (see below) (d) high level recognition despite poor Low-Level detail, (e) data derived from an image to be usefully structured (compare 'Scripts' and 'Frames'), (f) goal-directed activation or de-activation of large chunks of knowledge (e.g. 'mental set'), and (g) communication between different sensory modalities. A.I. vision work has so far focused on a small number of mathematically tractable special cases. A good survey of the different domains of structures useful in visual perception is still Lacking. Likely relevant domains include 2-D arrays of changing colour and intensity, 2-D configurations of lines and regions and of texture, domains involving patterns of motion in both 2-D and 3-D, overlapping 2-D silhouette shapes [Paul 19761, curved and flat 3-D surfaces, both 2-D and 3-D stick figures [Palmer, 19751, various domains involving forces and a variety of cause-effect relations, intentional actions etc., properties Like flexibility, rigidity, elasticity, hardness, etc. Besides plane surfaces, edges, vertices and generalised cylinders for representing shapes, we probably need generalised spheres, hemispheres, bags, tubes, strings, etc. In addition we need models for significant parts of such objects and their surfaces, like: hollows, grooves, holes, Lumps, ridges, openings, rims, etc., and models for relating one to another (the groove runs across the hollow). Features and relations invariant under non-rigid transformations XX ions are particularly important in our world. We also need a large collection of schemas for types of motion and action: moving towards, moving away from, moving into, flattening, twisting, folding. Compare Hayes on 'naive physics' [19791. Studies of pictures and cartoon movies can yield useful insights into the structures deployed in perception [Draper 19601. Of course, it is hard to specify how such models may be represented, invoked, etc. in a working, system. Is all this "cognition" relevant to vision? A major feature of visual learning is linking new domains into the visual system -e.g. Learning to see the muscular structure of human bodies, for artistic or medical purposes, learning to see when it is safe to cross the road. There is no sharp boundary between practically useful vision and cognition. 4. The domain of images The structure of the 2-D image domain is important for both picture interpretation and normal vision: why? Goodman [1969 p.381, rejecting the idea that pictures and objects produce similar visual input, accounts for the "realism" of some pictures in terms of familiarity. But this fails to explain why even a two year old child can learn some pictorial styles, whilst others, though mathematically equally adequate, seem much harder. The human visual system does not work with arbitrary combinations of image elements, but, as the Gestalt psychologists noted, is Largely constrained to use continuity, proximity, smoothness, concurrency, symmetry, containment, and other geometric and topological relationships, for linking Low-Level features into cues which invoke more abstract or global representations, which may themselves be similarly treated. A grasp of such relationships is required for interpreting pictures also. However, much richer image description languages are required than existing A1 programs can handle: many can only describe the topology, and a few metrical properties, of networks of straight lines or picture regions. Others provide a simple semi-articulated description with no grasp of the implied structure [e.g. Radig 1978]. Further, articulated 3-D interpretations, required for planning actions, can be linked to image structures to facilitate ~processing. For instance, to answer the question "what is Y going to hit?", "will I pass near A if I go straight towards B?" one can "traverse" the relevant part of the image to find the relevant bit of the 3-D interpretation. Moreover, our theory implies that in visual perception and in picture interpretation, descriptions of parts of a complex 3-D scene are built up in parallel. The linking of incomplete descriptions of different parts of the scene to form larger structures, will be facilitated if the 3-D structures are closely related to the network of descriptions of 2-D image structure -the latter providing indexing or addressing routes. [*4]. This applies to both real vision and interpretation of pictures. (More on this below. So, against Goodman we claim that "familiarity" of pictorial representations is not a matter of frequency, but depends in part on the way 2-D relationships are used in normal vision. Of course, mere similarity of domains does not suffice to explain facility with pictures. Maps also make use of 2-D structures and relationships, yet learning to use a map to find one's way around is harder than interpreting pictures. This is partly because our stored knowledge of objects is addressable by means of the kinds of articulated representations produced by both retinal images and artificial pictures, whereas our 'cognitive maps' of familiar surroundings are not normally addressable by the kinds of structures created when we look at maps. Things might be different if we could fly! 5. Reasons for using impoverished articulated representations There are additional reasons why impoverished picture structures might be related to normal vision. We have already given a general reason why a visual system needs to be able to cope with impoverished representations: articulation of the scene implies reduction of information. Other reasons concern processing, the purposes of vision and the environment: 5.1. Some details may interfere. Much of the detail available to the eye arises from variable conditions, including Lighting, atmosphere, viewpoint, non-rigid motion, and changing relations. The use of abstract schemas implies less memory space, faster matching, smaller searches among stored specifications and enables recognition of individuals or types (abstracting from individual details) in novel circumstances. It also provides the basis for forming generalisations. 5.2. Some details aren't needed in a "cognitively friendly" world. It may be possible to distinguish objects on the basis of only a few features. E.g. a colleague once remarked that he could recognize a zebra with just its ear visible. In a CFW where the space of possible structures is known to be sparsely instantiated ((B) above), inferences can be made from fragmentary evidence. 5.3. Details may be missing or spurious. As already noted, poor visibility, natural or artificial camouflage, eye defects, rapid motion, or the presence of visual obstacles, can produce degraded images. Injury can remove stereopsis. Optical flow is not always available. Stereopsis and optical flow don't help with distant stationary scenes. Extracting global features (e.g. silhouette descriptions) from such degraded data sometimes enables recognition of useful cues to overcome the difficulties. Once again, this depends on friendliness: e.g. important objects having distinctive outlines from most views. This requires assumptions (A) and (B). 5.4. Shared structure in memory entries. The system may share recognition processes between different objects by using a discrimination net. As partial specifications are built XX lt up, the set of remaining possibilities narrows. [Birch 1978 describes such an extension to Popeye.3 Different recognition processes thus share significant sub-processes, minimising back-tracking or breadth-first searching. This uses incomplete descriptions, i.e. intermediate nodes in the discrimination net. 5.5. The need for speed. Even in a CFW, unfriendly circumstances may demand rapid decisions. The next section discusses the relevance of incomplete data. 6. Speed and the processing of incomplete representations Complex articulated representations cannot be created instantaneously. Fast parallel processing at low Levels depends on each processor being concerned with a relatively small well-defined portion of the data, and being able to work independently or co-operate with a relatively small set of neighbours. Thus, even data-flow channels can be 'hard-wired'. (Such mechanisms permit certain non-local interactions, via information propagated through the net.) But locality and independence do not characterise the process of articulating a mass of data into objects whose contributory regions change from one image to another. Portions of images relevant to a triangle or tiger vary in size and shape, and may be split into separate regions by intervening objects. Hence data-flow cannot be pre-determined, and organising data from particular images will therefore take a significant amount of time, compared with localised parallel computations. Though detectors for all possible edges may be 'hard wired' in advance, detectors for all possible triangle or tiger shapes could not be similarly pre-determined, partly because of the explosion of connections, partly because not all environments include them. The task of segmenting, aggregating, recognising, and building useful scene descriptions is therefore inherently much slower than low-level tasks. Thus there are Limits to the speed-up available from hard-wired parallelism, and other mechanisms to speed things up could be useful: milliseconds may matter when life, or food, is at stake. Cues invoking previously computed information can speed things up. This old idea Ce.g. Roberts 19651 is now associated with the 'frames' theory [Minsky 19751. Compare the idea of a 'phrasal lexicon' [Becker 19751. But the theory leaves many questions unanswered: on encountering a new scene where should one start looking for cues in the image? At which level of analysis (in which domain) will the most useful cues be found? How can cues be recognised rapidly? The last question is very difficult, and will not be answered here. Our answer to the first two is that as far as possible analysis should proceed simultaneously in many Locations and at many Levels, since the Location or domain of the most useful cues cannot be predicted. This should be concurrent with general purpose image processing. Analysis of higher-level domains cannot begin until after some flow of data from Lower-levels, but it need not wait for completion. The structure of such a network of processes will vary from image to image, so time and resources may be saved if its growth can be constrained, eliminating or suppressing portions which are not required, and giving priority to those yielding useful results -e.g. activating and deactivating whole domains. This can be achieved if high-level structures (where the networks are relatively small) can be recognised whilst Lower Level networks are still incomplete. Thus construction of the network of communicating sub-processes which interpret the image, may itself be controlled by partial interpretations. If, at any level, there is a lot of partially processed information, things may be speeded up by treating the partial results as a new image, in which gross features provide useful higher-level cues: using redundancy in a CFW [Sloman 1978, ch 91. A specific purpose (e.g. finding a tool) might be achieved using this gross structure, without waiting for details [*5]. So, in some CFW environments, allowing many domains of structure to be analysed in parallel, could speed up actions. Even marginal advantages may influence biological evolution when resources are scarce, or predators plentiful. There is a kind of recursion in our argument, and possibly also in biological evolution. Where speed is important, the pressure towards further decomposition into parallel sub-systems is great, provided images have sufficient redundancy, i.e. provided it is a CFW. We have not claimed that higher Level processes can influence lower levels, except perhaps by aborting, or re-directing them. But it may be useful for partial results to affect some thresholds or even the invocation of specific forms of analysis, at Low levels. Alternatively, cognitive processes may simply control the direction of attention, without modifying the nature of the processing. Even if animal physiology permits no direct downward influence on the processes which generate, say, a primal sketch, there might still be good reasons for designing artefacts differently. It would be no different in principle from making high levels influence direction of gaze, dilation of pupils, convergence of two eyes, etc. all of which affect the low-level image. 7. Some implications In a CFW, multi-layered processing can improve flexibility, graceful degradation and speed. This applies to any kind of activity requiring intelligent analysis and interpretation of a large amount of data, based on expertise in the field, e.g. solving a complex mathematical problem, debugging a program, etc. One consequence is that demands on sub-systems are relaxed. For "instance, if processing of level P has to be completed before processing at level Q can be begun, then it is important that P terminate. However, if Q can get started early, then it does not matter if P refines its analysis indefinitely! In vision, input is continuous, so Lower levels cannot "finish" their analysis. Thus higher levels must in any case operate in parallel with them. Moreover, in a CFW, mistakes at lower Levels can be tolerated without disaster. The system must be conservative about transmitting items to higher-level domains, i.e. only sending well-supported reports. Then occasional mistaken reports will not combine usefully with other reports received at that level: (compare the role of 'impossible fragments' in Birch 1978). If a relatively large object is recognised on the basis of several different fragments reaching a high Level, then the chances of it being a mistake will be small, assuming limited independent variation of object features. So the system need not guarantee finding the best interpretation of any image, as in Woods [1977], since any good one will normally be unique, as we noted above. So it will often pay to accept a high level decision, abandon lower Level analysis, and re-direct attention to the next task [*5]. ALL this depends on knowledge enabling fragmentary evidence to invoke specific Larger structures, i.e. the principle of limited independent variation. General-purpose knowledge about 3-D structures and the principles by which they map into 2-D images does not constrain the space of possible scene structures so as to permit the inference that any good interpretation of an image is probably the best one. E.g. it does not rule out the existence of animals combining features in bizarre ways. Without specific knowledge of the world, detection of a zebra's ear would not rule out an animal with a trunk, six legs and two tails. The world would then not be a CFW. (This is like employing frequently useful theorems as well as axioms, to control search for proofs in a theorem-prover.) Our arguments are not relevant to the design of a machine whose visual system will never need to act quickly, which will always have perfect viewing conditions and which will often be transferred to a totally new environment where only the most general and primitive knowledge of 3-D structure, lighting, etc. will be of use to it. Of course, our parallel, schema driven, system will sometimes make mistakes: but people make mistakes and sometimes learn from them. How? Decomposition into sub-systems processing different classes of structures provides opportunities for learning about now rules for linking the different domains, and for inhibiting the invocation of schemas, as well as defining new types of structures in terms of previously known substructures. 8. Problems of incompleteness This theory raises many unanswered questions. Frank O'Gorman has pointed out in an unpublished manuscript that in a pass-oriented system, where each level of analysis is completed before the next begins, incompleteness of information at a certain Location and level has a definite meaning: i.e. it represents the absence of something in the image. We have found it important to distinguish two sorts of incompleteness. It is not too difficult to cope with a gap in a known structure, for instance a hypothesised letter "E", for which the lower "ell" junction has not yet emerged from lower levels. We call this explicit incompleteness: a filler is missing for a slot in a frame. Here there are only two candidate letters "EM or "F", and the word-recogniser can decide which is correct on the basis of other 1XXletters.which XX have emerged -even if they too are ambiguous. This depends on Limited independent variation of letters in the domain of possible words. Implicit incompleteness occurs when trying to link features together to form cues to drive recognition --for instance two previously unattached strokes to form a stroke-junction. Whether such features should be linked often depends on which other features are present nearby. From the absence of neighbours it cannot be decided whether this is because there is no evidence at lower levels, or because processing in that region has not yet finished. In early versions, every level of Popeye[*2] simply used whatever information had already emerged, and then relied on context, or later bottom-up processing, to correct mistakes. Errors were reduced by delaying processing of any one Level until a certain amount of information had been received at that Level, using thresholds determined by image statistics. But even this left the garbage- collection problem of undoing mistakes and their consequences. So higher Levels confronted with this incompleteness were allowed to ensure that everything up to that Level, within a restricted region of the image, had been processed, making use of image-related addressing routes. This caused the focus of attention to jump about, centering on important image features such as junctions between "bars". A better, more psychologically realistic solution, might be to let each level constantly recompute its hypotheses on the basis of the most recent information from other processes. This would be a generalisation of mechanisms using local co-operative processes, like relaxation. It could be very expensive on current computers, and hard to control. 9. Testing the theory experimentally The fact that very young children learn to interpret cartoons and other.. - 'impoverished' pictures so easily seems to support this theory. More detailed studies of what they find easy might be helpful. There is some additional evidence for our claim that higher level processing begins before Lower level analysis is complete. People often think they've recognised a person or object, then spontaneously realise that a mistake has been made, even after the object has passed from view. Informal experiments with messy pictures of overlapping capital letters forming a word suggest that people often see the word before seeing all the Letters. More detailed studies could provide clues as to domains and analyses being processed in parallel, in ordinary vision. Studies of brain damage might indicate which domains of structure (section 3) can be selectively disabled. Useful evidence should come from a study of visual errors. Our theory predicts that even in good visibility, humans and other animals moving rapidly will make more mistakes in an environment containing unfamiliar sorts of objects. (Testing this could be difficult, expensive and dangerous! ) Experiments could test whether increasing familiarity improves performance (of survivors! 1. Different mixtures of familiar and unfamiliar features could be used, to find out if more obvious familiar features lead to errors concerning the other features. Additional experiments would vary Lighting, foggy atmosphere, etc. as well. In poor viewing conditions, our theory mould predict that visual judgements (especially at speed) could be more accurate when the environment contains familiar objects. Comparative studies might show that only some animals with visual systems possess the ability to process a variety of different domains in parallel. FOOTNOTES [*1] Acknowledgments: This work is supported by the U.K. Science Research Council. We have benefitted from discussions with: Geoffrey Hinton, Frank O'Gorman, Steve Draper, Margaret Boden, Max Clowes, Monica Croucher, Steve Hardy, Christopher Longuet-Higgins, David Hogg, Larry Paul, Phil Pettitt, John Rickwood, Robin Stanton and Sylvia Weir, among others. Mike Brady and an anonymous referee made useful comments on a previous version. Judith Dennison helped with production. [*2] Preliminary reports on POPEYE can be found in Sloman and Hardy [19763, Sloman et al. [1978], Birch [1978], and chapter 9 of Sloman [19787. See also Owen [19801. Popeye analyses artificially generated dot pictures representing words made of overlapping cut-out capital letters. It can recognise words whilst much of the lower level processing is incomplete. Details will be reported elsewhere. The 1978 conference paper discusses differences between Popeye and the Hearsay system [Erman and Lesser, Hayes-Roth and Lesser], which have much in common. In particular, both process different domains of structure in parallel, though Popeye eschews the 'blackboard' concept. A similar philosophy has been used in the 'Visions' system (IJCAI-5, pp 642-6471. [*3] Marr makes a similar but different claim in justifying his theory of the 'primal sketch', Ce.g. Marr 19793. He postulates a progression, from image to primal sketch to 2.5D sketch to 3D model, whereas we propose many more domains, processed in parallel. In Popeye, the domains mainly form a hierarchy, but there are two main routes from image data to Better hypotheses and both feed the word recogniser. We suspect that real visual systems require a far more elaborate network of routes through domains. [*4] In Popeye the need for this arises often, e.g. when two parts of a Letter are separated because of occlusion. The two parts can sometimes only be related by using a combination of (a) geometrical relationships and (b) partial recognition of the letter, since there are no image cues for linking, like 'back-to-back' tee junctions. So having recognised what may be, say, an E or an F, the program works out roughly where in the image evidence of a missing bottom stroke might be found, and this constrains searching. [*5] In Popeye, processing can be aborted when the highest level decides it has recognised the depicted word; lower level analysis will often be incomplete. TRUNCATED BIBLIOGRAPHY (Compressed owing to page limit) Becker, J.D. 'The Phrasal Lexicon' T.I.N.L.P. Eds. R.C. Schank and B.L. Nash-Webber. June 1975. Birch F. in Sleeman 1978. Clocksin, W.F., in Sleeman [ed] 1978. Clocksin, W.F. 'A.I. theories of vision', AISB Quarterly 1978 Clowes. M.B. 'Picture syntax' in Kaneff, 19r Clowes; M.B. 'On seeing-things', in 5s Journal vol 2, no. 1 1971. Draper S.W. 'A reply to Clocksin', Quarterly, 1979. Draper, S.W. 'Psychological relevance..' AISB Quarter1 1980 Erman L.D. and V.R. Lesser in IJCAI-4, M.1~9e Gombrich E.H. Art and Illusion Phaidon Press, 1962. Goodman N. Languages of Art, Oxford University Press, 1969 Hayes, P.J., 'The naive physics manifesto', in D.Michie (ed), Expert Systems & the Microelectronic &, Edinburgh Univ. Press, 1979. Hinton G., Relaxation and its role in Vision, Ph.D. thesis, 1977. Hochberg, J and V Brooks, 'Pictorial Recognition as an unlearned ability' Am Jour Psych, 1962. Hochberg, J Perception (2nd Ed) Prentice Hall, 1978. Horn, B. Overview Lecture on vision, in Sleeman [ed] 1978. Kaneff S. Picture Language e Machines~~XX Academic Press, 1970. Marr, D. 'Early processing: of visual information', in Royal Society 1976. Marr, D, Proceedings IJCAI 1979. Minsky, M.L. 'Steps towards artificial intelligence' 1961 Minsky, M.L. 'A framework for representing knowledge', in Winston [1975] Norman D.A. and D.E. Rumelhart Explorations in Cognition, W.H. Freeman 1975 Owen, D.B. 'Intermediate representations in ~POPEYE', this volume 1980. Palmer S.E., in Norman and Rumelhart 1975. Paul, J.L, 'Seeing puppets quickly' Proc AISB Conference, 1976. Radig, B., in Sleeman [ed] 1978. Roberts, L.G. in Tippett et al, Electro-optical Information Processing, 1965 Shirai, Y., in Winston [19751 Sleeman, D. (ed) Proc. AISB/GI Conference., Hamburg 1978 Sloman, A. and S.Hardy, Proc AISB Conference, 1976. Sloman. A, The Computer Revolution in Philosophy, Harvester Press, 1978. Sloman A, and D. Owen, G. Hinton, F. Birch, in Sleeman 1978. Stanton, R.B. 'Plane regions...'. in Kaneff 1970. Winston, P.H. (ed), The psychology of Computer Vision, McGraw-Hill 1975. Woods, W.A. in Proc. IJCAI-5, MIT 1977