Monthly Archives: July 2011

Missing the point in gesture-based interaction

This is a draft of a column I wrote for the ACM’s interactions magazine. It will appear mid 2011.


“Zhège” she said, pointing emphatically at the top right of her iPhone screen.  She leaned further into the gap between the passenger and driver seat of the taxi. Then, lifting her head, she pointed forward through the windscreen in a direction that, I assumed, was where we were hoping soon to be headed.

The taxi driver looked at her quizzically.

Undeterred, she repeated the motion, accompanied by a slower, more carefully enunciated rendition of the word: “zhège”. This time she added a new motion. She pointed at the bottom left of her iPhone screen, at herself, at the taxi driver himself, and then at the ground below us. Balletic though this motion was, it did not reduce the look of confusion on the driver’s face.

Gently taking the device from her hand, he studied the screen. A moment later, his expression changed. He smiled and nodded. He stretched out the index finger on his right hand, pointed to the location on the screen she had isolated, and said “zhège”. He handed the device back to her, flipped on the meter, and grasped the steering wheel. A second later we accelerated out of the taxi rank. He had understood the point of her point(s).

My traveling partner, Shelly, and I know precisely 6 words of Chinese. ‘Zhège’ is one of them. We cannot actually pronounce any of the words we know with any consistency. Sometimes, people nod in understanding. Mostly they don’t. However, the scenario I painted above is how we navigated two weeks in China. The word ‘navigated’ is intentional–it is about the physical and conceptual traversal of options. We navigated space and location. We navigated food. We navigated products. We navigated shopping locations, shopping possibilities and shopping traps (always a concern for tourists, wherever they may be). We did all this navigation speechless; owing to our linguistic ignorance, we accomplished it by pointing. We pointed at menus. We pointed at paper and digital maps. We pointed at applications on our phone screens. We pointed at ourselves. We pointed at desired products. We pointed in space toward unknown distant locations… Basically, we pointed our way to just about all we needed and/or wanted, and we got our way around Beijing with surprisingly few troubles.

Pointing of this kind is a deictic gesture. The Wikipedia definition for ‘deixis’ is the “phenomenon wherein understanding the meaning of certain words and phrases in an utterance requires contextual information. Words are deictic if their semantic meaning is fixed but their denotational meaning varies depending on time and/or place.” [1] In simpler language, if you point and say “this”, what “this” refers to is fixed to the thing at which you are pointing. In the scenario above, it was the location on a map where we wanted to go. Linguists, anthropologists, psychologists and computer scientists have chewed deixis over for decades, examining when the words “this” and “that” are uttered, how they function in effective communication and what happens when misunderstandings occur. In his book Lectures on Deixis, Charles Fillmore describes deixis as “lexical items and grammatical forms which can be interpreted only when the sentences in which they occur are understood as being anchored in some social context, that context defined in such a way as to identify the participants in the communication act, their location in space, and the time during which the communication act is performed”. Stephen Levinsohn in his 1983 book, Pragmatics, states that deixis is “the single most obvious way in which the relationship between language and context is reflected”.

Pointing does not necessitate an index finger. If conversants are savvy to each other’s body movements–that is, their body ‘language’–it is possible to point with a minute flicker of the eyes. A twitch can be an indicator of where to look for those who are tuned in to the signals. Arguably, the better you know someone, the more likely you will pick up on subtle cues because of well-trodden interactional synchrony. But even with unfamiliar others, where there is no shared culture or shared experience, human beings as a species are surprisingly good at seeing what others are orienting toward, even when the gesture is not as obvious as an index finger jabbing the air. Perhaps it is because we are a fundamentally social species with all the nosiness that entails; we love to observe what others are up to, including what they are turning their attention toward. Try it out sometime, stop in the street and just point. See how many people stop and look in the direction at which you are pointing.

Within the field of human-computer interaction–HCI–much of the research on pointing has been done in the context of remote collaboration and telematics. However, pointing has been grabbing my interest of late as a result of a flurry of recent conversations where it has been suggested that we are on the brink of a gestural revolution in HCI. In human-device/application interaction, deictic pointing establishes the identity and/or location of an object within an application domain. Pointing may be used in conjunction with speech input–but not necessarily. Pointing does not necessarily imply touch, although touch-based gestural interaction is increasingly familiar to us as we swipe, shake, slide, pinch and poke our way around our applications. Pointing can be a touch-less, directive gesture, where what is denoted is determined through use of cameras and/or sensors. Most people’s first exposure to this kind of touch-less gesture-based was when Tom Cruise swatted information around by swiping his arms through space in the 2002 film Minority Report. However, while science fiction interfaces often inspire innovations in technology–it is well worth watching presentations by Nathan Shedroff and Chris Noessel and by Mark Coleran on the relationship between science fiction and the design of non-fiction interfaces, devices and systems–there really wasn’t anything innovative in the 2002 Minority Report cinematic rendition of gesture-based interaction, nor in John Underkoffler’s [1] presentation of the non-fiction version of it, g-speak, in a TED Talk in 2010. Long before this TED talk, Richard Bolt created the ”Put that there” system in 1980 (demoed at the CHI conference in 1984). In 1983 Gary Grimes at Bell Laboratories patented the first glove that recognized gestures, the “Digital Data Entry Glove”. Pierre Wellner’s work in the early 1990’s explored desktop based gesture based interaction and Thomas Zimmerman and colleagues used gestures to identify objects in virtual worlds using the VPL DataGlove in the mid 1980’s.

This is not to undermine the importance of Underkoffler’s demonstration; gesture-based interfaces are now more affordable and more robust than these early laboratory prototypes. Indeed, consumers are experiencing the possibilities everyday. Devices like the Nintendo Wii and the Kinect for Xbox 360 system from Microsoft are driving consumer exuberance and enthusiasm for the idea that digital information swatting by arm swinging is around the corner. Anecdotally, an evening stroll around my neighbourhood over a holiday weekend will reveal that a lot of people are spending their evenings jumping around gesticulating and gesturing wildly at large TV screens, trying to beat their friends at flailing.

There is still much research to be done here, however. The technologies, their usability but also the conceptual design space needs exploration. For example, current informal narratives around gesture-based computing regularly suggest that gesture-based interactions are more “natural” than other input methods. But, I wonder, what is “natural”? When I ask people this, I usually I get two answers: better for the body and/or simpler to learn and use. One could call these physical and cognitive ergonomics. Frankly, I am not sure I buy either of these yet for the landscape of current technologies. I still feel constrained and find myself repeating micro actions with current gesture-based interfaces. Flicking the wrist to control the Wii does not feel “natural” to me, neither in terms of my body nor in terms of the simulated activity in which I am engaged. Getting the exact motion on any of these systems feels like cognitive work too. We may indeed have species specific and genetic predispositions to being able to pick up certain movements more easily than others, but that doesn’t make most physical skills “natural” as in “effortless”. Actually, with the exception of lying on my couch gorging on chocolate biscuits, I am not sure anything feels very natural to me. I used to be pretty good at the movements for DDR (Dance Dance Revolution) but I would not claim these are movements in any sense natural, and these skills were hard won with hours of practice. It took hours of stomping in place before stomping felt “natural”. Postures and motions that some of my more nimble friends call “simple” and “natural” require focused concentration for me.  “Natural” also sometimes gets used to imply physical skill transfer from one context of execution to another. Not so. Although there is a metaphoric or inspired-by relationship to the ‘real’ physical work counterparts, with the Wii, I know I can win a marimba dancing competition by sitting on the sofa twitching and I can scuba-dive around reefs while lying on the floor more or less motionless, twitching my wrist.

An occupational therapist friend of mine claims that there will be a serious reduction in repetitive strain injuries if we could just get everyone full-body gesturing rather than sitting tapping on keyboards with our heads staring at screens. It made me smile to think about the transformation cube-land offices would undergo if we redesigned them to allow employees to physically engage with digital data though full-body motion. At the same time, it perturbed me that I may have to do a series of yoga sun salutations to find my files or deftly execute a ‘downward facing dog’ pose to send an email. In any case, watching my friends prance around with their Wiis and Kinects gives me pause and makes me think we are still some way away from anything that is not repetitive strain injury inducing; we are, I fear, far from something of which my friend would truly approve.

From a broader social perspective, even the way we gesture is socially prescribed and sanctioned. It’s not just that you need to have a gesture be performed well enough for others to recognize it. How you gesture or gesticulate is socially grounded; we learn what are appropriate and inappropriate ways to gesture. Often assessments of other cultures’ ways of gesturing and gesticulating are prime material for asserting moral superiority. Much work was done in the first half of the 20th century on gestural and postural characteristics of different cultural groups. This work was inspired in part by Wilhelm Wundt’s premise in Volkerpsychologie that primordial speech was a gesture and that gesticulation was a mirror to the soul. Much earlier than this research, Erasmus’ bestseller De civilitate morum puerilium [8], published in 1530 an admonition that translates as “[Do not] shrug or wrygg thy shoulders as we see in many Italians”. Adam Smith compared the English and the French in terms of the plenitude, form and size of their gesturing. “Foreigners observe that there is no nation in the world that uses so little gesticulation in their conversation as the English. A Frenchman, in telling a story that is of no consequence to him or anyone else sill use a thousand gestures and contortions of his face, whereas a well-bred Englishman will tell you one wherein his life and fortune are concerned without altering a muscle.” [2]

Less loftily, cultural concerns for the specifics of a point were exemplified recently when I went to Disneyland. Disney docents point with two fingers, not just an outstretched index finger but both the index finger and middle finger. When asked why, I was informed that in some cultures pointing with a single index finger is considered rude. Curious, I investigated. Sure enough, a (draft) Wikipedia page on etiquette in North America states clearly “Pointing is to be avoided, unless specifically pointing to an object and not a person”. A quick bit of café-based observation suggests people are unaware of this particular gem of everyday etiquette. Possibly apocryphally, I was also told by a friend the other night when opining on this topic that people that in some Native American cultures it is considered appropriate to point with the nose. And, apparently some cultures prefer lip pointing.

So bother with this pondering on pointing? I am wondering what research lies ahead as this gestural interface revolution takes hold. What are we as designers and developers going to observe and going to create? What are we going to do to get systems learning with us as we point, gesture, gesticulate and communicate? As humans, we know that getting to know someone often involves a subtle mirroring of posture, the development of an inter-personal choreography of motion–I learn how you move and learn to move as you move, in concert with you, creating a subtle feedback loop of motion that signifies connection and intimacy. Will this happen with our technologies? And how will they manage with multiple masters and mistresses of micro-motion, of physical-emotional choreography? More prosaically, as someone out and about in the world, as digital interactions with walls and floors become commonplace, am I going to be bashed by people pointing? Am I going to abashed about their way of pointing? Julie Rice and Stephen Brewster of Glasgow University in Scotland have been doing field and survey work on just this, addressing how social setting affects the acceptability of interactional gestures. Just what would people prefer not to do in public when interacting with the digital devices, and how much difference does it make if they do or don’t know others who are present? Head nodding and nose tapping apparently are more likely to be unacceptable than wrist rotation and foot tapping [3]. And what happens when augmented reality becomes a reality and meets gestural interaction? I may not even be able to see what you are thumbing your nose at–remembering that to thumb one’s nose at someone is the highest order of rudeness and indeed the cause of many deadly fights in Shakespearean plays–and I may assume for the lack of shared referent that it is in fact me not the unseen, digital interlocuter at whom the gesture is directed. And finally, will our digital devices also develop subtle sensibilities about how a gesture is performed beyond simply system calibration? Will they ignore us if we are being culturally rude? Or will they accommodate us, just as the poor taxi driver in China did, forgiving us for being linguistically ignorant, and possibly posturally and gesturally ignorant too? I confess; I don’t know if pointing with one index finger is rude in China or not. I didn’t have the spoken or body language to find out.

NOTE: If you are as bewildered by the array of work on gesture-based interaction that has been published, it is useful to have a framework. Happily, one exists. In her PhD thesis Maria Karam [4] elaborated a taxonomy of gestures in the human computer interaction literature, summarized in a working paper written with m.c. schreafel [5]. Drawing on work by Francis Quek from 2002 and earlier work by Alan Wexelblat in the late 1990’s this taxonomy breaks research into different categories of gesture style: gesticulation, manipulations, semaphores, deictic and language gestures.

[1] John Underkoffler was the designer of Minority Report’s interface. The g-speak tracks hand movements and allows users to manipulate 3D objects in space. See also SixthSense, developed by Pranav Mistry at the MIT Media Lab.
[2] For more on this see A Cultural History of Gesture, Jan Bremmer and Herman Roodenburg, Polity Press, 1991
[3] Rico, J. and Brewster, S.A. Usable Gestures for Mobile Interfaces: Evaluating Social Acceptability. In Proceedings of ACM CHI 2010 (Atlanta, GA, USA), ACM Press.

[4] Karam, M. (2006) PhD Thesis: A framework for research and design of gesture-based human-computer interactions. PhD thesis, University of Southampton.

[5] Karam, M. and schraefel, m. c. (2005) A Taxonomy of Gestures in Human Computer Interactions. Technical Report ECSTR-IAM05-009, Electronics and Computer Science, University of Southampton.