Natural Language Processing in CALL: How It Works and Why It Doesn't

Frank Berberich, Tokiwa University

 

Introduction

This article is an overview of what natural language processing (nlp) is, how it works, and how it fits into CALL now and might in the future. A principal position of the article is that nlp has a legitimate place in CALL now, but is a long way from fulfilling its apparent promise of implementing a computer teacher that can converse with students naturally. The article focuses upon speech-based language.

What is nlp?

The term "natural language processing," or more conveniently "nlp," is decades old, perhaps coined as a parallel to "data processing" and its successor "information processing." Nlp refers to manipulations of language occurring in nature, such as human language. These manipulations are usually done on computers and often involve simulations of human language behavior. Understandably, nlp has tended to be a major focus of Artificial Intelligence, or AI. (Note: nlp is not NLP or Neural Linguistic Programming—an unrelated area sharing the same initials.)

There are countless examples of nlp systems now in common use, ranging from the simple to the hugely complex. At the simplest level are machines that use oral or written prompts and responses instead of sound signals or lights.

More complex are spell-checkers, dictionaries, concordance systems, synthetic telephone voice messages, "voiceprint" security systems, voice stress analyzers, ticket reservation systems by telephone, dictation systems, grammar checkers, language translators, intelligent database query systems, and so forth.

At very high levels are the AI efforts intended to simulate human language interactions on common topics, or in specific circumstances. These include expert systems used for various training and advisory activities in many fields and conversation systems. Beyond such systems are the machine-humans found in science fiction.

Broadly speaking, nlp is applied to spoken and written language and can be divided into the functions of recognition and understanding. Some typical auditory-based are in Voice Recognition (VR), Speech Recognition (SR), and Speech Understanding (SU—more generally Language Understanding, or "LU"). Similarly specialized systems are used for text processing.

Nlp in CALL

Almost by definition, CALL has included rudimentary nlp since its beginnings—in its use of language prompts, for example, such "Press Y or N", or, "Choose one". Many packages also offer spoken prompts to which the user responds with keyboard input. Currently, however, several packages offer features that respond to the user's spoken input and some programs have very limited "conversational" repertoires, such as being able to carry forward a greeting dialogue (Berberich, 1997a).

For the sake of completeness, it should be noted that suggestions have been made to impose limitations on the role of nlp in CALL (Kruse, 1998). These ideas are a response to perceived dangers of de-humanization in misleading people into believing that they are having a human interaction with a machine, and the related risks of elevating machines to human

status. Perhaps these fears can be addressed in the manner of warning labels on food, by displaying a suitable notice reminding the user that a machine is at the other end of the conversation.

A typical nlp system

The plan of this discussion is to present examples of common language tasks and then show, conceptually, how such tasks can be accomplished with nlp. Implications for CALL are part of the discussion. We look at tasks in VR, SR, and SU.

Fido's three nlp tasks

Consider the dog Fido, who has learned behavior related to human language. We could interpret this behavior to mean that Fido has memories of certain language patterns and these memories are linked to other memories of things to do in response. Here are three examples of Fido nlp:

(1) If Fido barks upon hearing his master's voice, that is voice recognition. Fido reacts to a known voice quality.

(2) If Fido extends a forepaw upon hearing "Shake hands", and taps a hind leg upon hearing "Tap foot", these are examples of speech recognition. Fido reacts to a known speech pattern.

(3) Now, if Fido, upon hearing the command "Shake foot" for the first time, were to extend a hind foot, that would be language understanding. Fido appears to recognize speech patterns and to separate them into what we would call words, to link these in a novel way, and to link the new pattern to a novel behavior.

All three of these examples have a common sequence that we could call the typical nlp cycle. It is:

recognition t understanding t response

First, the speech input is recognized, then in some way linked to an appropriate response, which is then output.

The cognition metaphor

We can say that Fido recognizes things because there is a match between some event—a voice or command—and a memory of a similar event. In a literal sense of the word, Fido is re-cognizing the event. The use of the term "recognition" suggests a metaphor that neatly fits the electronic storage and retrieval activity involved in nlp. The cognition phase is the activity of entering into the nlp a database of language elements and links among these elements. Usually, people do this data entry. The recognition phase is the retrieval of these elements (and their associated links) from the database if they match some input.

An example of a very simple database might be (where the '//' indicate an utterance, the "t" a link, and '[ ]' indicate a physical response):

/Shake hands/ t [extend forepaw]

/Tap foot/ t [wiggle hind leg]

Here, the elements are the commands and actions. With this database, an nlp could "recognize" just those two phrases and produce the corresponding response. As it happens, such a database is called a set of "production rules"; the rule A t B means that, given A, produce B.

Fido's behavior could be imitated on a computer by displaying an animated figure. For an even simpler example, a text-based conversation database could be:

Hi. How are you? t Fine, thanks, and you?

Thank you t You're welcome.

For the technically-minded reader, the mechanics of this database is a simple text file consisting of a table with just two columns. If the input comes from the keyboard, the system matches this input to an entry in one column and displays on the screen the corresponding entry in the other column. The link between the two elements is simply that they are in the same row of the table. In AI, this technique is called "pattern-matching" and I will use the terms "recognize" and "match" somewhat interchangeably. A simple nlp system like this using Excel is described by Berberich (1997b).

Feature detection and representation

Using keyboard input for nlp offers two great simplifications. First, the translation from a keypress to an electronic signal is very simple because the key is a fairly simple switch. Second, there is a precise link between each key pressed and the resulting data within the machine, and so a direct correspondence between the input and the data used for processing. The input data is exactly the same as the stored data, and exact matching is possible.

Fido's input does not come from a keyboard, however, but from a human voice. How does Fido recognize such input? It seems unlikely that Fido stores all sounds precisely as they are produced, like a tape recorder, and then matches the input to these precise patterns. To do so would be very wasteful of storage, and, taken to its logical extreme, prevent learning altogether. For, if an exact match were required for recognition, even the small variations that are inherent in individual speech would make such a match impossible. Fido could never learn.

In all three of Fido's tricks, it seems that Fido must be recognizing just certain features of the sounds and acting upon those. Rather than dealing with entire speech patterns, it appears that Fido extracts and stores only some parts of the input. These are then stored in some way and future instances of speech are matched to them.

Fido's response in example (3) in particular confirms this. In order to produce the novel behavior of (3), Fido must detect and recognize differences in the two similar commands—the first part of both commands has the same features that we would think of as "shake", but the second parts are different. Having done this, Fido must recognize "shake" and "foot" as separate elements. (The final step of synthesizing the newly recognized elements is discussed below.)

More generally, these considerations apply to our ability to understand speakers with vastly different vocal qualities and accents, and in noisy environments. The widely accepted view is that we store data in a form quite different from the originating input, having first extracted certain significant features from it. We detect important patterns in the input and internally represent these in some vastly simplified form.

Feature detection in speech

The ear first filters and transforms (or, more formally, "transduces") the rapid variations of air pressure caused by the sound into nerve impulses. Several layers of the auditory nerve system extract significant patterns from this stream of impulses, and present these to the brain for storage. The extraction process is called feature detection and the transformation into another form is called representation or encoding.

In computers, a speech pattern is transformed into a rapidly varying electrical signal by an audio system and this signal is converted to a stream of numbers by a digital processor. Features of this stream can be extracted and stored as a collection of electrical states that correspond quite closely to, mostly numeric, symbols.

Typical speech features correspond to the phonetic elements of voice and language. More abstractly, a significant feature is something that distinguishes one type of element from another. Such features are illustrated in contrastive pronunciation pairs, where all but the distinguishing feature is the same.

The same approach can be applied to types of input other than speech. In OCR, for example, systems typically detect such features as points, edges, closed and open loops, line crossings, free-floating and supported elements, high, middle and low placement, etc. Each of these features is encoded as a number and a group of such numbers represents a specific letter.

For example, if five numbers are used to represent any letter, and if the number 1 represents a single closed loop, 00001 might be a representation of the letter "o", while a "p" might be encoded as: left line/upper right partially supported closed loop—say, arbitrarily, 13270.

Speech representation

The speech signal can be encoded as numbers that express the strength of the air pressure at any moment, and these numbers stored as a data file. For an idea of how much data this can be, consider that, in order to faithfully store music, some current audio CDs systems divide time into about 44,000 moments per second and encode each of these moments into many thousands of increments of pressure values.

Fortunately, we seem to need far less data to represent speech features than we do for the complete range of music sound. For example, there are only four frequency regions, called "formants" commonly recognized as critical to understanding human speech (Brown and Deffenbacher, 1979). These formants are particular frequency areas within the total sound that are strongly affected by changes in the vocal tract, such as the position of the tongue. Other frequency characteristics can be extracted to help refine the representation of the voice quality and the speech pattern and, of course, word boundaries. One current SR system uses about 20 features for its representation of speech (Hunt, 1996).

As might be expected, defining appropriate features to detect and designing their representation is a major part of the theoretical and technical challenge of SR. While formants are fairly intuitively obvious features to select for detection and representation, some others are not. These might include such features as the rates of change of frequencies due to common prosodic patterns in a language, and the density of certain frequency areas over time.

Theory-free modeling

A powerful technique for defining significant language features is to simply subject a large corpus of natural language to statistical analysis and look for clear patterns. For example, simple counting can reveal the frequency of occurrence of sounds and words—useful data for organizing a database. Other patterns might very accurately correspond to phonetic features and be useful for word recognition. No explanation of these features, nor theory of the language, is needed.

In computers, the processing load to perform this detection and representation is great enough that special processors, called "digital signal processors", or DSPs, are incorporated into, or available as add-in sound boards for, desktop computers.

Speech recognition

While Fido's task is to perform some action upon hearing specific sounds, the task of an SR system is to produce a textfile from a continuous speech input. So far, Fido has had to recognize only a few words. In most human speech, however, many thousands of words are in constant use and potentially millions more might occasionally appear. The SR process is similar in both cases, but the magnitude of the problem is vastly different and some new strategies are needed to attack it.

An indication of the processing power needed to recognize continuous, fairly normal speech is that such systems have only become viable commercial products in the past two or three years. The basic process remains pattern-matching, but on a grand scale. The SR system must continuously detect a sufficient number of features to enable it to search a very large database to match a pattern to a corresponding word.

Vastly simplified, an SR database might look something like this (where the first column is the numerical representation of a word):

00000 t (nil)

00001 t a

00002 t the

00003 t there

36012 t recognize

In this example, the database is arranged in order of the numerical representation and with more common words near the beginning. Clearly, if the database had no internal arrangement—no "structure"—it is, on average, necessary to search about half of the database for each new word in the input. However, nlp databases usually have a great deal of structure and there are many clever ways to search such structures quickly. The entire collection of possible patterns is called the "search space" and a key strategy in nlp generally is to reduce this space as quickly as possible.

As shown above, an example of a database structuring approach would to arrange it in the order of frequency of occurrence of various sounds, sound pairs, triplets, etc., in the language. In written English—which is loosely related to spoken words, for example—the most frequently occurring letter is "e", followed by "t", "o", etc., and the most frequently occurring pairs are "th", "he", etc. Each layer of organization reduces the search space greatly because all options above that layer may be eliminated.

Search strategies

The basic goal of a search strategy is to reduce the search-space as quickly as possible to a single likely candidate. A reasonable search SR strategy might be: Match using only the first two or three features—an educated guess, as it were—of the input pattern and then confirm that the remaining incoming features continue to match the "guess". If they don't, back up to the feature that does match and try the next likely option.

For example, if the features suggest the sound /th/ then a plausible guess might be "the" and this guess might be confirmed by the succeeding features; if the succeeding features suggested /tha/ then "that" would be a reasonable next guess, and so forth. This process continues until the guess is confirmed to some predetermined level of certainty whereupon it is stored and displayed, or disconfirmed, and the system beeps or inserts something to signal an unknown.

Notice that an exact match is not necessary. It may be that just a few features are enough to distinguish certain words in certain contexts. As it happens, most programming languages offer a similar relaxation of matching standards even in pure text processing. For example, many word processors allow the use of "wildcard" characters when searching for a word. One might look for all five-letter words that begin with a "th" and end with a "k" by searching for "th**k". Using this technique, one can allow for certain kinds of typing mistakes in user input.

Does Fido do SR this way? Do we? Likely not. Typical SR systems depend almost entirely on just the input speech patterns and the phonetics of the language alone, and make very little use of higher level language structure, such as syntax, grammar, semantics, or meaning. In contrast, we seem to do SR using all of these levels at once. One demonstration of this is the well-known "cocktail party effect". In a crowded room with many people talking among themselves, we can attend to specific conversations by following their content even though the noise may obscure many of the words. In short, machine SR uses mostly the sound context alone for processing, while natural SR processes on many levels at once.

Speech recognition in CALL

As mentioned before, several commercial CALL packages offer SR in some form. These use a very drastically simplified database and search strategy in that they reduce the SR task to recognizing any of a few 10s of pre-determined sentences. For input, the user is asked to produce items from a displayed list of these options. Such systems are thus, properly speaking, sentence recognition system.

Simplified as they are, at present they fail to detect native speaker productions significantly often. They also have great trouble with variations in voice quality and accent and some require the user's voice-type—for example, male, female, child—to be entered before using the system. Full dictation systems usually need a few minutes of training to sharpen recognition of each user's speech patterns. Given the rapid increase in processing power of desktops, more flexible and accurate SR should be a reality within the next decade.

How about using an SR dictation package for pronunciation practice? Again, the technology is lacking, for now. A recent conference presentation (McCarthy, 1999) included a video segment of a very patient Japanese learner struggling to get a successful dictation of a straightforward sentence. Again, future developments are sure to bring this application into practicality.

Language understanding

SU systems are usually simply text-based systems with an SR front-end. Once the SR system converts the input to text, it can be processed as a text file, so the more general term LU is suitable. Much of what is called LU is produced by simple manipulation with a number of tricks such as matching certain cues, or performing simple formal transformations, from the input.

Taking examples from an Eliza-type program (a psychotherapy session simulation), if the input contains a family word such as "mother" or "wife", the system could produce, "Tell me more about your family." One can form questions from certain statements by deleting parts of the input and adding some words or phrases. Thus,

I feel down today t Do you often feel down?

Men are all alike t Why do you think men are all alike?

I'm really excited t Are you often really excited?

On the surface, these seem like fairly reasonable responses, and perhaps even people sometimes interact fairly automatically this way. Once the mechanism is understood, however, it seem hardly worth calling AI or LU.

Most would agree that Fido's "footshake" is an intelligent response. Assuming that it wasn't simply an arbitrary reaction to confusion, it is an example of a novel, but reasonable, response to an unexpected input. This kind of behavior is certainly not what we expect of machines, almost by definition, and we might be comfortable calling it an example of understanding.

As it happens, however, computers can easily manifest virtually random behavior, as we can see from various game programs and electronic game machines. Most programming languages include a random number function, and one simply makes a numbered list of responses and then repeatedly uses the random number function generate numbers to arbitrarily the corresponding elements in the list. Making the response is interesting or clever is a different, and more difficult, problem.

In "Eliza", for example, when the user's input can't be matched, a neutral response such as "Hmmm…", or "Please continue", is produced. (Hardly a strategy exclusive to machines; good L2 speakers in particular become quite skilled at producing neutral response to buy time while trying to deal with some difficult input.) It would be boring and unrealistic if the computer always replied with the same neutral response, so a list of several is used for random selection. One of my own versions of Eliza always gets a laugh when it pops up with, "Have you tried chicken soup?" Of course, the content of the response was my own attempt at humor, and it is non-specific enough to work anytime.

Unexpected and interesting

Perhaps one of the most "human" aspects of human communication is our ability to produce both unexpected and interesting language. We often enjoy turning a formulaic exchange into an interesting conversation, such as in the old joke,

Say "Hello" to Mom.

Hello to Mom.

Here, the response is both unexpected and, perhaps, clever. It is a deliberate misinterpretation of the request that becomes a joke.

Could Fido produce the clever footshake using only pattern-matching? We saw earlier that Fido must recognize that /shake/ is common to both commands but that /hands/ and /foot/ are distinct. To do this, Fido must match the first part of the command to "shake" and the second to "foot". Some meta-pattern is needed that suggests that parts of a command can have their own meaning and be recombined with others. This is grammar, which most would doubt that Fido has:

/shake foot/ t (/shake/ + /foot/)

from which would follow naturally,

/shake/ t [extend]

/foot/ t [hind leg]

and need only a new link that does something like apply the first behavior to the second body part:

/shake X/ t shake (applied to) X.

In AI parlance, this new kind of link would be an example of a "semantic link" in which the abstraction-link that there are things that can be done to something else and "shake" is one of them. This is a correspondence similar to what we would call "meaning". It is easy to imagine a hierarchy of such abstractions within the database, linking elements into groups, and these groups into larger groups, and so forth. This kind of database becomes a representation of the language—its vocabulary, syntax, grammar, and semantics. Among these, representations of what constitutes "cleverness" are needed, such as potential plays on words, use of idioms, current events, etc.

In principle, however, no matter how complex the database, the system is still simply looking for patterns to match and respond to. The only thing lacking in such a system is personal experience to give it "real understanding".

As with the simple matching database, meta-links that express relations among elements of the database are usually entered by people, not created by the system. There are, however, systems that "learn" by employing such relationships to organize new input automatically. That is, they have patterns that can be matched with input to determine how elements of the input should be linked into the database. For example, if a new word entry includes information that it is a verb, and if the database includes a pattern that recognizes this information, then it can be linked to all patterns that involve verbs.

LU databases can, of course, be very large. If they are then to be combined with an SR database to create an SU system, both the data and the search space become huge. Representations of a reasonably complete vocabulary with word definitions and the language itself must be included, as well as data about the world, if the system is to have an ordinary conversation. Such systems have been under development for many years and continue to improve, but convincing examples still require industrial-strength computing power far beyond that of a desktop.

Speech understanding in CALL

As with SR in CALL, a practical approach to reducing the database and search space is to restrict the number of reasonable responses, both from the user and the system. In the case of SR, the number of utterances available to the user is at most 20 or so, and the system need only search for reasonable matches among those few.

In SU, similarly, a highly formulaic conversation domain reduces the vocabulary and interaction options drastically. For example, a restaurant food-ordering scenario is likely to include only a greeting, and some discussion of the menu items and their preparation, and a closing gambit. It is easy to write sufficient branching options into a program for this kind of conversation. Insofar as one believes in learning through practicing functional dialogues, such programs might be quite useful.

There are many other modest LU applications for CALL. For example, the word processor I am using to write this article offers a continuous error-correction feature which highlights many possible problems such as misspellings, possible grammar errors and confusing sentences. It automatically corrects certain misspellings such as "teh" or "adn" but lets me leave them, as I have here, if I type them twice the same way. As processing power and CALL programming sophistication increase, it is likely more such helpers and tutors will emerge.

Some philosophy

The reader may have noticed throughout this article a certain degree of interchangeability in terminology, whether referring to living or machine nlp. This is intentional. One can explain a great deal with some very simple operations such as database organization and pattern-matching. We saw in the Fido examples that "understanding" needed only a hierarchically organized database in order to represent relations among the data elements beyond a simple correspondence. One might speculate that such hierarchies are what differentiate us from, say, Fido. Since we can do hierarchical databases with machines also, then we might well ask, what differentiates us from them?

One difference, of course, is direct experience of the world. Much of our thought and cleverness comes from the richness of experience and the ability to apply that to novel situations. Experience data can be programmed, but it is still certainly second-hand. At most, a machine might have something to say about power outages, overheating, or jammed floppy disks. Another more fundamental difference is in how computers and brains work. The current model for computers involves static data storage, where data sits, unchanging, in storage and must be moved from it to a special place for processing.

In sharp contrast is our brain, which uses dynamic storage, as Schank calls it (Schank & Cleary, 1994). (I also presented a similar notion about 15 years ago, later published under the rubric of the "thinking library") (Berberich, 1994). The concept is simply that data in living brains is continuously organizing itself with other data, forming new links and hierarchies. Data in brains is stored as circulating patterns of nerve impulses that are facilitated into long-term memory by changes in brain-cell structure. As these patterns flow, they can combine with others and form new patterns that we would call new ideas. This is taking place among some hundreds of billions of cells each with up to a thousand or so connections to other cells. Certainly, parallel computing on a colossal scale!

Thus, it seems reasonable to expect that we are a long way from conversational CALL, a long way from the day that, as was suggested in a recent CALL conference, in contrast to now when language learners pay extra to use computers, people will pay extra to interact with a human (Lewis, 1999). There are, however, many intermediate levels that can be explored, and these are well within the capabilities of desktops and amateur programmers.

References

Berberich, F. (1997a). Partner—A computer simulation for dialogue pair practice. Gaikokugo kyoiku ronsyu (Studies in Foreign Language Teaching), University of Tsukuba Foreign Language Center, 16, 49-58. Available http://www.tokiwa.ac.jp/~frank

Berberich, F. (1997b). Conversation in CALL: some theory and technique. C@lling Japan, 6 (4), 10-13. Available http://www.tokiwa.ac.jp/~frank

Berberich, F. (1994). Kangaeru toshokan (The thinking library). Bulletin of the Unversity of Library and Information Science. Available http://www.tokiwa.ac.jp/~frank

Brown, E. L. & Deffenbacher, K. (1979). Perception. New York: Oxford University Press.

Hunt, M. J. (1996). Signal representation. Survey of the state of the art in human language processing. Retrieved August 5, 1998 from: http://www.cse.ogi.edu/CSLU/HLTsurveych1node5.html#SECTION13human

Kruse, M. (1998). New wine in old bottles: Is there a future for ELIZA after all? In P. Lewis (Ed.), Teachers, learners, and computers: Exploring relationships in CALL (pp. 205-214). Nagoya: The Japan Association for Language Teaching Computer-Assisted Language Learning National Special Interest Group.

(Note: A collection of implemetations of ELIZA can be found at http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/classics/eliza/0.html; accessed 9/29/99)

Lewis, P. & McCarthy, K. (1999, May). Natural Language Processing in CALL. Discussion conducted at the CALLing Asia conference, Kyoto, Japan.

Schank, R., & Cleary, C. (1994). An experiment in memory and knowledge. Engines for Education. Retrieved August 5, 1998 from http://www.ils.nwu.edu/~e_for_e/nodes/NODE-4-pg.html