A. Neri, C. Cucchiarini, H. Strik, L. Boves
A²RT, Department of Language and Speech
University of Nijmegen, The Netherlands
{A.Neri, C.Cucchiarini, H.Strik, L.Boves}@let.kun.nl
ABSTRACT: In this paper, we examine the relationship between pedagogy and technology in Computer
Assisted Pronunciation Training (CAPT) courseware. First, we will analyse available literature on second
language pronunciation teaching and learning in order to derive some general guidelines for effective training.
Second, we will present an appraisal of various CAPT systems with a view to establishing whether they meet
pedagogical requirements. In this respect, we will show that many commercial systems tend to prefer
technological novelties to the detriment of pedagogical criteria that could benefit the learner more. While
examining the limitations of today’s technology, we will consider possible ways to deal with these shortcomings.
Finally, we will combine the information thus gathered to suggest some recommendations for future CAPT.
1. Introduction
The advantages that Computer Assisted Language Learning (CALL) can offer are nowadays well-known to
educators struggling with traditional language classroom constraints. Computer Assisted Pronunciation Training
(CAPT), in particular, can be beneficial to second language learning as it provides a private, stress-free
environment in which students can access virtually unlimited input, practise at their own pace and, through the
integration of Automatic Speech Recognition (ASR), receive individualized, instantaneous feedback. It is not
surprising, then, that a wealth of CAPT systems have been developed, many of which are available on the market
for the language teacher or the individual learner.
When examined carefully, however, the display of products may not look entirely satisfactory. Many
authors describe commercially available programs as fancy-looking systems that may at first impress student and
teacher alike, but eventually fail to meet sound pedagogical requirements (Watts 1997, Murray & Barnes 1998,
Price 1998, Warschauer & Healey 1998, Pennington 1999). These systems, which do not fully exploit the
potentialities of CAPT, look more like the result of a technology push, rather than of a demand pull. This may
not necessarily be due to a lack of willingness, on the part of the developers, to include pedagogical guidelines in
the design. It may simply be due to a failure to adopt a multidisciplinary approach involving speech
technologists, linguists and language teachers (Cole et al. 1998, Price 1998), or more fundamentally, to the
absence of clear pedagogical guidelines that suit these types of environments.
What are, then, the guidelines that should be considered when developing a pedagogically sound CAPT
system? We believe that research on second language acquisition and teaching can already provide us with some
indications on which ingredients are needed for effective pronunciation training. Although much work still needs
to be done, especially with respect to the issue of feedback, we feel that it is possible to suggest ways to blend
these ingredients in order to obtain the optimal outcome. However, incorporating this knowledge within state-ofthe-
art technology may not be as straightforward as educators hope. Current ASR technology, for instance, still
suffers from several limitations that pose constraints on the design of CAPT, as is exemplified by the occasional
provision of erroneous feedback.
In this paper, we will first analyse available literature on traditional pronunciation training in order to
identify the basic pedagogical criteria that a system should ideally meet. Second, we will provide a critical
evaluation of those CAPT systems that more closely fulfill those demands, with a view to establishing which
pedagogical aims can be achieved with state-of-the-art technology. In doing so, we will focus in particular on
the issue of feedback. Finally, we will combine the information thus gathered in an attempt to provide some
recommendations for the development of CAPT systems that employ state-of-the-art technology in order to
meet pedagogical requirements.
2. Second language pronunciation training: ideal pedagogical requirements
According to many researchers, the biggest problem in CALL is a lack of guidelines from the second language
acquisition research field that could be used to better employ the enormous progress recently made in
technology, to design better courseware (Levy 1997, Pennington 1999, Chapelle 1997, 2001). Of all CALL tools,
2
this problem particularly affects CAPT systems (Pennington 1999). Although valuable criteria have been
outlined in the past few years to evaluate CALL, these are either of a general nature (as in Chapelle 2001) or
they mainly concern computer assisted learning of vocabulary or grammar, while pronunciation is hardly
mentioned. This scarcity of indications makes it hard for CAPT practitioners to develop effective courseware.
Similarly, within traditional teaching environments, conscientious teachers trying to devise optimal
pronunciation training for their students are faced with many questions.
Why has research as yet been unable to offer a straightforward answer to these questions? First of all, the
considerable variety in teaching contexts and learning aims makes it difficult to set hard-and-fast rules that can
be applied across different learning settings. Besides, until recently, many educators were convinced that
teaching pronunciation was pointless because accent-free pronunciation of the second language (L2) was
considered a myth (Scovel 1988) and because training would either have no impact or, even worse, would hinder
the natural, unconscious process needed for the acquisition of pronunciation (Krashen 1981, Krashen & Terrell
1983). This view led to a general tendency to neglect pronunciation in favour of grammar and vocabulary in
research on second language acquisition. As a result, little information is available on how pronunciation can
best be taught.
Some of these beliefs have been contradicted by recent studies indicating that tailor-made training can
improve a learner’s pronunciation in the L2 to such a degree that – to human judges - s/he can sound
indistinguishable from a native speaker (Bongaerts 1999). Other studies have evidenced a general intolerance for
strong foreign accents that might place learners in the L2 country at a professional or social disadvantage
(Brennan & Brennan 1981, Morley 1991). Furthermore, the number of professionals who regularly communicate
in a foreign language for their work has increased with globalisation. In order to ensure that these learners are
able to efficiently communicate in the L2, it is imperative that language teaching methods include pronunciation
training.
With respect to the ultimate goal of pronunciation training, many researchers now agree that, while
eradicating the finest traces of foreign accent might only be necessary for the training of future spies
(Abercrombie 1991), a reasonably intelligible pronunciation is an essential component of communicative
competence (Abercrombie 1991, Morley 1991, Munro & Derwing 1995, Celce-Murcia, Brinton, & Goodwin
1996). In this respect, it is important to draw a distinction between intelligibility or comprehensibility on the one
hand, and accentedness of L2 speech on the other hand. These are two different, albeit related, dimensions of
non-native pronunciation. A strong foreign accent does not always hinder intelligibility of speech and specific
types of instruction do not necessarily lead to improvement of both these aspects (Derwing & Munro 1997). The
importance of “comprehensibility” over “correct pronunciation” also emerged from a recent study on user
requirements that was carried out within the framework of the European project ISLE, which aimed at
developing an automatic pronunciation training system for Italian and German learners of English (ISLE 1.4
1999). Many of the studies carried out so far have not drawn that distinction, thus obtaining blurred results that
make it difficult to make comparisons and draw significant, generalizable conclusions.
A close examination of recent research can nevertheless help to identify some of the factors that affect L2
pronunciation most severely and to derive general guidelines for the teaching of pronunciation. Various studies
have revealed that pronunciation learning is affected by a number of variables such as first language (L1), level
of education, age on arrival (for naturalistic settings), amount of use of L1 and L2, motivation for learning L2,
etc. (see Piske, MacKay & Flege 2001 for an overview). These are all factors that can vary from person to person
and that cannot be controlled directly by the teacher to produce the desired learning outcomes. However, there
are other variables that are also known to affect pronunciation learning and that can be manipulated so as to
obtain better results. These are input, output and feedback. These factors will be analysed in more detail in the
following three sections.
2.1. Input
According to interactionist theories, the basic ingredient for successful language acquisition is input. Students
must be able to access large quantities of input, so that target models become available. Although the majority of
the studies on the impact of different types of input have addressed the acquisition of linguistic aspects other
than pronunciation (see Schachter 1998), there are reasons to believe that input can benefit pronunciation
learning. As pointed out by Leather and James (1996), the initial production of new speech patterns, whether in
L1 or L2, implies some phonetic representation in auditory-perceptual space that must have been previously
derived from exemplars available in the community or explicitly presented during training. Just like for the
acquisition of L1 sounds, multiple-talker models seem to be particularly effective to improve perception of novel
contrasts as the inherent variability allows for induction of general phonetic categories (Logan et al. 1991). To
this end, it may be important that lip movements be visible for the students, as both seeing and hearing a sound
that is being articulated has been shown to improve production and perception (Massaro 1987, Jones 1997).
3
It has also been suggested that specific instruction on different pronunciation aspects can lead to
improvement of those aspects (Derwing, Munro & Wiebe 1998, Flege 1999, Bongaerts 2001). This may be taken
as an indication that metalinguistic awareness is conducive to learning gains in pronunciation. With regard to the
way input should be presented, teachers should try to contextualize input, as meaningful learning, i.e. learning
through associations, generally facilitates long-term retention (Ausubel 1968). Furthermore, input that is
meaningful to a learner is perceived by the learner as relevant to his/her needs, a factor that can stimulate
intrinsic motivation and thus indirectly favour learning (Dörnyei 1998, Keller 1983). Another way to stimulate
learner motivation is to present the student with engaging input that also accommodates different learning styles
(Crookes & Smith 1991, Oxford & Anderson 1995). For instance, input could be presented in written, aural and
audio-visual form (eg. a radio interview and a short film episode).
2.2. Output
Although essential, mere exposure to the L2 does not appear to be a sufficient condition for pronunciation
improvement, as is exemplified by long-term foreign residents who retain a strong accent and are hardly
intelligible in the L2 (Morley 1991). As a matter of fact, it is now generally accepted in second language
acquisition research that, if the learners’ aim is to speak the foreign language fluently and accurately, it is
necessary for them to practise speaking it (Swain 1985, Swain & Lapkin 1995, Kendrik 1997). By producing
speech, learners can test their hypotheses on the L2 sounds. Learners can compare their own output with the
input model and consequently form correct L2 representations. Through production, speakers receive a first,
proprioceptive feedback on their own performance: auditory and tactile feedback is available from air- and boneconducted
pressure changes and from contact surfaces of articulators, while feedback from the joints, tendons,
and muscles provides a sense of articulatory positions and movements; motor programs are then gradually
adjusted until a satisfactory match is made between feedback signals and target model (De Bot 1983, Leather &
James 1996). Furthermore, through practice, knowledge about the L2 already internalised can become more
automatic and thus enhance fluency (de Bot 1996). Output also allows elicitation of more input and feedback
from peers (Swain & Lapkin 1995) and engages self-monitoring skills – aspects that appear to be linked to good
L2 performance (Rubin 1987, O’Malley and Chamot 1990, Wharton 2000).
However, the activities aimed at developing the students’ productive skills should be designed carefully.
Contrary to past practice, they should not be limited to ‘listen and repeat’ drills of isolated, decontextualized
sounds as in the case of minimal pairs: exclusively attending to the sheer mechanical, articulatory aspects of
pronunciation and achieving accuracy and dexterity in controlled practice does not necessarily lead to
transferring those same skills to actual conversation (Jones 1997). Aspects should also be considered that are
typical of connected speech, therefore sentences and dialogues should also be taken into account - especially
those that are more likely to occur in everyday communication. Moreover, varied practice material should be
chosen that meets different individual cognitive styles, in order to stimulate student motivation and participation,
two aspects that go hand in hand with good performance (Rubin 1987, O’Malley & Chamot 1990, Morley 1991,
Wharton 2000). The exercises, for instance, could follow a recurring pattern made up of different formats (eg.
‘listen and repeat’ exercises, ‘build a sentence’ exercises, dialogues and role-plays etc.)
Finally, in order to encourage even the most reticent students to engage in talk, special care should be taken
to create a stress-free environment, as communicative tasks in the L2 have been shown to generate the highest
levels of anxiety of all learning tasks (Young 1990). This need is particularly acute when training adults, who are
generally more inhibited than children and reluctant to produce speech in a foreign language for fear of losing
face – or even their linguistic identity (Guiora et al. 1972).
2.3. Feedback
The issue of feedback is still controversial: there appears to be a general disagreement on the definition of
corrective, implicit, explicit, or metalinguistic feedback, on whether different types of feedback should be
considered as a form of positive or negative evidence, and on what constitutes evidence for the effectiveness of
this factor, especially where pronunciation is concerned. On the whole, however, research on adult second
language acquisition indicates that corrective feedback from teachers, peers or native speakers makes adult
learners notice the discrepancies between their output and the L2 (Long 1996), an awareness which mere
exposure to the L2 does not guarantee. According to the ‘noticing hypothesis’ formulated by Schmidt (1990), it
is only this awareness that can lead to the acquisition of a specific linguistic item. The importance of feedback
appears even more obvious for learning L2 pronunciation, because many errors produced by L2 learners can be
attributed to unconscious interference phenomena from the L1 built-in phonological representations (Flege
1995). The L1 influence can be so overwhelming that the learner may not perceive the deviations in his/her
interlanguage from L2 standards. Feedback must then come into play, and more specifically, “a type of feedback
that does not rely on the student’s own perceptions” (Ehsani & Knodt 1998: 9 ). Through the provision of
4
feedback, teachers can bring the students to focus on specific individual problems and (indirectly) stimulate them
to attempt self-improvement. It is obvious that it is only once this awareness has been raised that the individual
can take remedial steps.
In spite of the crucial role of this factor, very little research has been carried out on the effectiveness of
different types of feedback. Recent studies seem to indicate that recast, i.e. a “rep etition with change” (and
possibly with emphasis) of the student’s incorrect utterance (Chaudron 1977: 39), is the most common type of
feedback adopted by teachers (see Nicholas, Lightbown & Spada 2001 for an overview). Recasts seem to be
effective because they are unobtrusive and thus do not interrupt the conversational flow, and because they are
immediate and thus allow for comparing and noticing of the new item to be learned (Nicholas et al. 2001). With
regard to pronunciation, it is worthwhile mentioning a study conducted by Lyster (1998) in French immersion
classrooms, which investigated teachers’ feedback strategies in student -teacher interactions and attendant learner
uptake – i.e. the immediate repair that students adopt on the basis of feedback. In this study, Lyster found that
recast had the highest rate of uptake for phonological errors, while it yielded the lowest rates of uptake for
grammatical and lexical errors. For these errors, the use of elicitation feedback, which always required the
students to attempt to generate the correct form themselves, produced the highest rates of correct studentgenerated
repairs. Contrary to research on complex grammatical tasks (Nagata 1993, Lyster & Ranta 1997,
Crompton & Rodrigues 2001), which indicates that feedback should generate self-repairs by stimulating higherorder
cognitive processes in the learner, Lyster’s (1998) study suggests that a simple reformulation of the
mispronounced utterance immediately following the student’s turn might be sufficient to su ccessfully correct it.
Other studies have also indicated the effectiveness of recasts at least for errors for which the learner has already
acquired the specific linguistic alternative offered and thus only needs to activate lower-order functions
(Nicholas, Lightbown & Spada 2001, Nagata 1993). Recasts have also been indicated as a good form of
feedback for beginner learners, because these learners are not proficient enough to discover the correct version
themselves if an error is merely signalled (Lightbown 2001). However, in this case recasts only seem to lead to a
temporary repair, rather than to the long-term retention of the correct item (Nicholas, Lightbown & Spada 2001).
It should be noted that most studies, including Lyster’s 1998, have only investi gated the short-term effects
of corrective feedback, because of the difficulty in isolating this factor in a real learning environment. Moreover,
contradictory results have been obtained because of inconsistent operationalizations of different types of
feedback among different studies. What seems uncontroversial is that feedback should not be limited to
classifying a response as correct or wrong, but should pinpoint specific errors and possibly suggest a remedy
(Chun 1998, Warschauer & Healey 1998, Crompton & Rodrigues 2001). In other words, besides receiving a
score, the student should comprehend why s/he got that score. It goes without saying that teachers do not need to
provide feedback on each of the student’s mistakes: such a course of action might be d iscouraging for the student
and extremely lengthy for the teachers. The pronunciation errors to be addressed could be selected on the basis
of different criteria, such as the ultimate aim of the training - be it accent-free pronunciation or intelligible
pronunciation - the specific L1-L2 combination, the degree of hindrance to comprehensibility and the degree of
persistence of the various errors, the student’s level of proficiency in the L2 etc.
A number of studies have addressed the issue of pronunciation error gravity hierarchies in an attempt to
establish which errors should be given priority in a pronunciation training programme (Van Heuven et al. 1981,
Anderson-Hsieh et al. 1992, Derwing & Munro 1997). The problem with many of these studies is that they suffer
from methodological limitations, because no distinction was drawn between accentedness and intelligibility (see
Derwing & Munro 1997). As a result, the findings of the various studies may sometimes seem contradictory.
Although clear indications are still lacking, it appears that both segmental and supra-segmental factors are
important (see Derwing, Munro & Wiebe 1998 for an overview). Segmental errors can preclude full
intelligibility of speech (Derwing & Munro 1997, Rogers & Dalby 1996). On the other hand, lexical stress and
intonation are important too, as they help listeners to process the segmental content by adding structure to the
complex and continuously varying speech signals (Celce-Murcia, Brinton & Goodwin 1996). Furthermore, both
levels are so tightly interwoven that, while they can be separated and measured instrumentally, in reality they
influence each other, as the case of stress placement well illustrates.
2.4. Conclusions
On the basis of this brief synopsis, we can outline some basic recommendations for the ideal design of effective
pronunciation teaching and learning. Learning must take place in a stress-free environment in which students can
be exposed to considerable and meaningful input, are stimulated to actively practise oral skills and can receive
immediate feedback on individual errors. Input should pertain to real-world language situations, it should include
multiple-speaker models and it should allow the learner to get a sense of the articulatory movements involved in
the production of L2 speech. Oral production should be elicited with realistic material and exercises catering for
different learning styles, and should include pronunciation of full sentences. Pertinent and comprehensible
feedback should be provided individually and with minimum delay and should focus on those segmental and
supra-segmental aspects that affect intelligibility most.
5
3. CAPT systems
If we assume that traditional class-based pronunciation teaching should be shaped according to the
recommendations that we just outlined, CAPT systems should be able to follow the same recommendations.
Moreover, if implemented adequately, CAPT can even offer a number of advantages compared to classroom
instruction. First, by allowing the student to freely roam through the system, these programs make it possible to
address individual problems. In addition, on the basis of priorities set by the user, CAPT is nowadays able to
present the student only with certain tasks aimed at developing specific skills. Second, CAPT systems allow the
students to train as long as they want and at self-paced speed. Third, as some studies suggest (Murray 1999), the
privacy and the self-directed kind of learning offered by these environments may lead to a reduction of foreign
language anxiety - a phenomenon strongly linked to social-judgement factors (Young 1990) - and thus indirectly
favour learning. Furthermore, student profiles can be stored by the system in a log-file so that the students
themselves can monitor problems and improvements, which in turn might result in increased motivation.
Alternatively, the teacher can refer to the logs and suggest appropriate remedial steps. Finally, the student might
in certain cases receive feedback on oral performance from the program itself, in real-time. On account of these
advantages, there have been various attempts to develop CAPT systems. However, the ideal requirements that
we sketched in the previous section are not often met by existing CAPT systems.
3.1. Input and output in CAPT
Needless to say, all the systems that are currently available provide abundant oral input. Some systems –
presumably in an attempt to save disk space and compact the package in a single CD-Rom - make use of stills to
accompany the information provided orally, sometimes adding text in balloons (ILT 1997, Auralog 2000).
Several systems also provide information on the way the target speech sounds are to be produced by explaining
how the articulators should be positioned. This is often done by means of a 3D representation of a mouth
producing a sound, sometimes accompanied by a written explanation (Auralog 2000, Glearner 2001, Pronunciation
2002), or by a video of a native speaker pronouncing the targeted sound (see for instance Glearner
2001, Nieuwe Buren 2002, the Advanced series, Eurotalk 2002). Animations and videos are obviously to be
preferred: while the mouth animations provide precise and realistic visual cues of single phones, the film
fragments also include information on facial expressions and gestures that accompany L2 speech acts and thus
provide information on pragmatic function too. Moreover, research indicates that the use of digital multimedia
materials can foster language learning because it looks authentic and appealing, it promotes proactive
involvement and engages various learning processes (Wachowicz & Scott 1999, Liontas 2002). Despite the
pedagogical usefulness of such functionalities, we have seen that speaking is crucial for improving
pronunciation. Therefore, a system that only provides input and merely trains receptive abilities will appear
remarkably limited from a second language learning perspective.
For this reason, most current CAPT systems are designed so as to stimulate the user to produce speech that
can subsequently be recorded and played back. The student can thus study his/her own output and attempt to
improve it by comparing it with a model, pre-recorded utterance. Examples of these systems are described in
Tutsui (1999) and van de Voort (1995); the reader is also invited to consult the CALICO Reviews (2002), the
LLT Software Reviews (2002), and the CALL Product Reviews (2002), which critically describe many systems
featuring these functionalities. The main problem with such systems is that it is up to the students to determine
whether and how their utterances differ from the native ones, while they may lack the criteria and the awareness
required to perform such an evaluation. As we have already pointed out, numerous studies have revealed that L2
learners often fail to perceive phonetic differences between their L1 and the L2 and that therefore external
feedback is needed (see 2.3.). On the other hand, those systems that require a teacher to listen to the recordings
and to evaluate them suffer from unfavourable teacher-student ratios, just like language classes in schools and
universities (eg. Nieuwe Buren 2002). Moreover, the functionalities offered by these systems are not innovative
compared to those employed in the traditional language lab. Finally, there are systems for distance learning that
resort to external feedback. These systems require the students to first practise and record themselves and then
either up-load the audio-files to a web page or send the files via e-mail in near real-time. Licensed trainers listen
to the files, evaluate and score them, and finally send them back to each student (Ferrier & Reid 2000, Ross
2001). This time, the problem is due to the fact that the student has to rely on a third party, and the feedback
arrives with a substantial delay.
On account of these shortcomings, we will now consider those systems that provide input, opportunity for
student’s output and automatic feedback that the student can retrieve and study when and as long as s/he wishes.
More specifically, we will provide an appraisal of the most representative CAPT systems featuring these options
by concentrating on the issue of feedback, which represents the biggest challenge for systems that claim to aim at
6
providing a one-to-one tutor-student interaction, and by looking at how the pedagogical indications we have
outlined are practically implemented with presently available technology.
3.2. Feedback in CAPT
The exact notion of external, corrective feedback is far from clear: in second language acquisition, the term
generally refers to information provided by native speakers or teachers on a non-targeted utterance – often called
negative evidence - but a more detailed definition is lacking, as is a classification of different types of feedback
and of their respective effectiveness for learning. In CALL systems, the term is mainly used to refer to
information on errors or on performance on a task in general, as a form of assessment of success, thus including
scores as well. Sometimes the term is even used to refer to instructions, explanations or clues in help facilities
(see for instance Pujolà 2001). It is only too natural, then, that current computer-generated feedback on
pronunciation exploits different techniques and graphical displays, targets different aspects of pronunciation, and
is more or less informative and explicit. In the following section we will examine various approaches to
feedback, in an attempt to establish which forms are more effective for learning.
3.2.1. Visual displays
Some CAPT systems provide instantaneous feedback in the form of graphic displays such as spectrograms and
waveforms which are often accompanied – for comparison - by previously stored displays of a model utterance
pronounced by the teacher or by a native speaker. These systems, which are generally authoring tools, make use
of programs that perform acoustic analyses of amplitude, pitch, duration and spectrum of the students’ speech .
Some of these systems are WinPitchLTL (Germain-Rutherford & Martin 2000, WinPitch 2002) developed by
Pitch Instruments Inc., VisiPitch by Kay-Elemetrics (Molholt 1988, 2001, Kay 2002) and VICK (Nouza 1998).
Akahane-Yamada & McDermott (1998) and Lambacher (1999) also describe similar systems, which they used to
teach English consonants to Japanese native speakers. The signal representations used in these CAPT systems
were not originally conceived as a means to support pronunciation training. On the contrary, they were all
designed to support phoneticians and speech scientists in specialized scientific research. Nevertheless, research
on pronunciation has generally shown that these types of visual displays, if paired to auditory feedback, can
contribute to improving pronunciation, especially with respect to intonation (De Bot 1983, Anderson-Hsieh
1992, Akahane-Yamada & McDermott 1998). The effectiveness of these types of displays is nonetheless
questionable for a number of reasons.
First of all, while attesting the usefulness of visual displays, some researchers also hypothesize that the
improvements noticed after training with this type of systems might simply be the result of the fact that the
student has devoted extra time to practice (De Bot 1983). Second, these systems perform an analysis of the
incoming speech signal without first ‘recognizing’ the utterance. This implies that there is no guarantee that the
student’s utterance does indeed correspond to the intended one. Third, the fact that the system shows two
comparable displays, one corresponding to the incoming utterance and one corresponding to the model utterance,
wrongly suggests that the ultimate aim of pronunciation training is to produce an utterance whose spectrogram or
waveform closely corresponds to that of the model utterance. In fact, this is not necessary at all: two utterances
with the same content may both be very well pronounced and still have waveforms or spectrograms that are very
different from each other. Moreover, while capturing waveforms and computing spectrograms is relatively easy,
these kinds of displays are not easily interpretable for students. Actually, they are representations of raw data that
require the presence of a teacher to interpret them.
Another option when using systems of this kind might be to train the students to autonomously read the
spectrograms and the waveforms. However, even students who have received some specific training are likely to
have a hard time deciphering these displays and extracting, from these raw data, the information needed to
improve pronunciation: correcting articulatory behaviour on the basis of spectrograms and waveforms is
particularly difficult because there is no simple correspondence between the articulatory gesture and the acoustic
structure in the properties displayed. In other words, as many authors regret, this type of feedback is not in line
with the requirement that feedback should first of all be easy to comprehend (Ehsani & Knodt 1998, Eskenazi
1999, Menzel et al. 2000, Kommissarchik & Kommissarchik 2000). Finally, since spectrograms and waveforms
cannot tell the average learner much about his/her errors and the specific causes of those errors, students are
likely to make random attempts at correcting the presumed errors - which, instead of improving pronunciation,
may have the effect of reinforcing poor pronunciation and eventually result in fossilization (Eskenazi 1999).
Pro-nunciation (Brown 2001, Pro-nunciation 2002) is a prototypical system that aims at teaching
pronunciation of words by providing limericks, tongue twisters, 3D animated mouth representations of
phonemes, and the possibility to display waveforms of the student’s u tterance for comparison with the model
one. The criticism of these kinds of displays is all the more appropriate in the case of waveforms, since these are
7
even more variable and less informative than spectrograms. Other systems, like the Talk To me (TTM 2002) and
the more comprehensive Tell me More series (Auralog 2000), are not exclusively based on waveforms as a form
of feedback, in that a global score is also provided and words that are incorrectly pronounced within a sentence
are colour-coded. However, the graphical importance the waveforms have on the screen suggests that they are
presented because of their flashy look, to impress the users – i.e. the buyers.
A much-praised system, WinPitchLTL (Germain-Rutherford & Martin 2000, WinPitch 2002), has been
developed by two phoneticians working on speech technology and pedagogy, as an authoring tool for different
learning environments. This system is able to analyse recorded speech of a maximum duration of 12 minutes and
display the pitch curve, the intensity curve and the ‘speech signal’ (in the form of a waveform or of a
spectrogram). The main advantage of this system is that it features ‘word-processing’ facilities: the teacher can
easily segment the speech signal displayed, label it by adding text on the display, highlight with different colours
relevant segments in the melodic curve or significant cues on a spectrogram, thereby making important
information easily visible and retraceable for the student. These are operations that the system cannot perform
automatically as the technology that underlies it cannot segment a complex speech signal. WinPitchLTL also
contains a synthesis feature that allows the teacher to modify the prosodic parameters of a student’s utterance
and redesign its acoustic properties within a given range on the basis of the target model. In this way, the student
can hear the correct prosodic contours with his/her own voice, which has been shown to help the student to better
perceive important deviations (Nagano & Ozawa 1990). However, the effectiveness of this system totally relies
on the teacher: a teacher must be available who previously received sufficient training in phonetics and acoustics
and who is able to pass on that information to the students by editing the speech signal, while this, of course, is
not the common rule (Price 1998). While this system offers the stated advantage to help teachers clearly indicate
what a pronunciation problem was and how it can be improved, it is unlikely that a teacher will be able to edit a
large number of utterances in such a detailed way. In other words, feedback will be subject, once again, to time
constraints and unfavourable teacher-to-student ratios.
Sometimes graphic displays of pitch contours, without the addition of the oscillogram or spectrogram, are
used to give feedback on intonational patterns (see Chun 1998). Like other systems using displays, these
programs presuppose some degree of training in interpreting the displays. However, pitch contours are easier to
interpret than spectrograms or oscillograms. In addition, while it is doubtful whether attempting to match a
spectrogram or an oscillogram is a meaningful exercise, trying to approach a pitch contour does certainly make
sense. Kommissarchik and Kommissarchik (2000) have discussed the shortcomings of various forms of suprasegmental
feedback and have developed a system for teaching American English prosody to non-native speakers
of English, BetterAccentTutor, in which readily accessible feedback is provided. Visual feedback is provided on
all three components of prosody: intonation, stress and rhythm. The students listen to a native speaker’s
recording studying its intonation, stress and rhythm patterns, utter a phrase and receive immediate audio-visual
feedback from the system. Both the students’ and the natives’ patterns are displayed on the screen so that the
students can compare them and notice the most relevant features they should match. The system offers two
major, easy-to-interpret visualization modes: intonation - visualized as a pitch graph on vowels and semivowels -
and intensity/rhythm - visualized as steps (syllables) of various length (duration) and height (vowel’s energy).
This program, however, does not address segmental errors. The rationale behind the system is based on the
assumption that “the three factors that have the biggest impact on intelligibility of speech are intonation, stress
and rhythm” (Betteraccent 2002), but no hierarchy order for speech intelligibility has yet been established and
research has evidenced that segmental errors can be detrimental for comprehension too.
3.2.2. Automatic assessment
With the exception of BetterAccentTutor, the systems described above have in common that the computer
produces some kind of direct visual representation of the speech signals, and all interpretation or manipulation is
left to the student and/or the teacher. Let us now take a look at some programs that do not require constant
support of a teacher and that let the computer compare model and student’s utterances with a view to producing a
pronunciation quality score. In this case, the feedback usually consists of a numerical or symbolic score – e.g. an
icon such as a smiley, an oral comment such as ‘well-done’, or a graded -bar indicating the degree of ‘nativeness’
– which is automatically generated by the system. The usefulness of automatic scoring is evident as this
technology gives the learner immediate, comprehensible information on output quality. However, the great
challenge in developing systems of this kind is to define the appropriate automatic measures the computer has to
calculate, where appropriate means 1) strongly correlated with human ratings of pronunciation quality and 2)
suitable to be used as a basis for providing feedback. The importance of the relation to human ratings is obvious:
in the end the students will have to talk to people and not to machines, so the quality of the pronunciation has to
be determined on the basis of what people deem acceptable. The second point can best be illustrated by referring
to the case of temporal measures of speech quality. These measures appear to be strongly correlated with human
8
ratings of pronunciation quality and fluency, and are therefore suitable for pronunciation testing (Cucchiarini,
Strik & Boves 2000, Franco et al. 2000). However, they do not constitute an appropriate basis for providing
feedback on pronunciation: telling students to speak faster is unlikely to lead to an improvement in the quality of
their pronunciation. SRI’s FreshTalk exemplifies the sort of system in which measures of non-nativeness such as
temporal speech properties are used as a basis for providing feedback, and indeed, the feedback related to speech
rate did not prove to be effective for improving the students’ pronunciation skills (Precoda, Ha lverson & Franco
2000). Given the limited usefulness of scores, programs should not solely rely on this type of feedback. Rather,
they might use it to integrate more meaningful and detailed information on the student’s oral performance.
Other CAPT systems provide a similar, albeit more implicit and more realistic type of feedback. The Tell me
More and the Talk to Me series by Auralog (Auralog 2000, TTM 2002) allow the students to train
communicative skills through interactive dialogues with the computer. The student hears an oral question that is
simultaneously displayed on screen, and replies with an answer that s/he chooses from three written responses
that are phonetically different. Through ASR that has been specifically trained for non-native speech, the
computer recognizes the student’s utterance and accordingly moves on to the following conversational turn. If
the program does not understand the student, it will prompt him/her to repeat the response. As each choice leads
the dialogue along a different path, the program ensures a certain degree of realism. Additionally, the student can
choose to check his/her oral performance on a page displaying the score s/he received, the waveform and the
sentence s/he produced, with the mispronounced words coloured red. Another system that simulates a real-world,
game-like learning setting is the Microworld contained in the Military Language Trainer (MILT, the version
being used at the U.S. Military Academy - MITAS is its commercially available sibling) (MITAS 2002,
LaRocca, Morgan & Bellinger 2001, Holland, Kaplan & Sabol 1999). In this case, the student orally asks the
computer to perform a simple action in a room with several objects, such as ‘put the book on the table’. If the
computer understands the utterance, it will perform the command given by the student. A similar method is used
in CPI’s TraciTalk (see TraciTalk 2002, Wachowicz & Scott 1999) even though this system was conceived as a
more generic CALL environment, rather than a CAPT system: the student interacts with an animated agent
whose task is to help the student to solve a mystery using the target language. The type of feedback these
systems provide is undoubtedly very effective to reinforce correct pronunciation behaviour, as it realistically
resembles the type of interaction that would take place with a human interlocutor. Moreover, it exploits the
advantages that involvement in games has for learning (Wachowicz & Scott 1999) and it allows the student to
acquire knowledge through task-based learning, i.e. while achieving non-linguistic goals (Nunan 1989).
However, neither Microworld nor TraciTalk are able to offer any help if a student cannot make him/herself
intelligible because, for instance, s/he cannot correctly pronounce a certain sound.
A serious attempt at diagnosing segmental errors and providing feedback on them has been made in the
ISLE project (Interactive Spoken Language Education, Menzel et al. 2000, ISLE 4.5 2001). This system targets
German and Italian learners of English, and aims at providing feedback on pronunciation errors, focusing in
particular on the word level, for which it checks mispronunciations of specific sounds and word-stress errors.
Limiting the system to a (few) known language pair(s), allows for good recognition performance by the ASR:
not only is the system specifically trained to recognize non-native, deviant speech in the given L2, it is also
trained to recognize typical errors due to interference from (a) specific L1(s). However, this approach can only
be adopted for specific L1-L2 pairs for which sufficient knowledge of typical pronunciation errors is available,
as in the case of the languages addressed in the ISLE system. It follows that these systems are not able to handle
unexpected, idiosyncratic errors that may be frequently made by some learners and that may be detrimental to
intelligibility. Another limitation is that phonemically different sounds (such as /æ/ and / in English) overlap in
acoustic space if the models of the sounds must be trained independently of a specific speaker (which is
necessary if it must be possible for arbitrary learners to use the system). This makes it very difficult for the
system to reliably decide whether the English words land and end were pronounced with the correct vowel
(Hillenbrand et al. 1995).
The ISLE system provides feedback by highlighting the locus of the error in the word. In addition, example
words are shown and can be listened to which contain, highlighted, the correct sound to imitate and the one
corresponding to the mispronounced version. While this feedback design seems satisfactory, the system yields
poor performance results. The authors report that only 25% of the errors are detected by the system and that over
5% of correct phones are incorrectly classified as errors. As the authors comment, with such a performance
“students will more frequently be given erroneous discouraging feedback than they will be given helpful
diagnoses” (Menzel et al. 2000: 54). Thus, future CAPT systems that use ASR to detect pronunciation errors
should focus on errors that can be detected with a high degree of robustness. In addition, it should help if more,
carefully transcribed non-native speech in different L2s became available: this could be used to train an ASR
system for the specific task of detecting typical pronunciation errors. Nevertheless, even if the performance of an
9
ASR system is optimised, it will never be perfect, and, consequently erroneous feedback will occasionally be
provided.
Erroneous feedback is a common problem in CAPT systems using ASR technology (see for instance the
evaluation of TriplePlayPlus and Learn German Now! in the CALICO- and the LLT Software Reviews, 2002).
Patently wrong error detection can be so frustrating for the student that Wachowicz and Scott (1999) recommend
using implicit rather than explicit, judgmental feedback. For example, a system that only indicates the part of a
word or utterance that was mispronounced, without indicating exactly which erroneous sounds it recognized, is
likely to make fewer errors than the ISLE system, simply because it makes only half the number of decisions.
And, as some suggested with regard to recasts, telling the student that some areas in his/her utterance were
incorrect and offering him/her the possibility to listen to the correct version – without attempting to also play a
version of the confusable counterpart - might just be sufficient feedback. Still, it would be necessary to focus on
pronunciation problems that are robust to detect. It goes without saying that those are errors where the distance
between the wrong and correct pronunciations is relatively large. Even if these errors do not cause confusions
between words, they are so conspicuous for a listener that they are likely to affect intelligibility.
3.3. Conclusions
To summarize, this overview of available CAPT systems has identified a number of pros and cons of these
systems, which should be taken into consideration when developing new prototypes. On the whole, we have seen
that an ideal system should provide input, output and feedback, and should incorporate ASR technology.
With regard to input and output, we have observed that presently available technology is sufficiently
advanced to match the pedagogical requirements sketched in section 2. The technology can now even offer
possibilities that are not available in traditional classroom learning. The limitations of those systems that make
use of outdated or less effective multimedia are only attributable to economic constraints or choices made by the
developers, and not to problems inherent in the technology.
What still remains problematic is the issue of feedback: its implementation in CAPT systems needs to be
studied carefully. We have seen that it is only through the integration of ASR technology and pedagogical
guidelines that we can design programs providing real-time, pertinent and easy-to-interpret feedback both on
segmental and supra-segmental aspects. However, the limitations in current ASR technology imply that error
diagnosis will only be possible with a limited degree of detail. Even if pedagogically desirable, detailed
diagnosis is simply not feasible because the performance levels attained are too poor. Reliability is crucial in
language learning: nothing could be more confusing for a learner than a system reacting in different ways to
successive realizations of the same mistake. It therefore seems that, if we want to reach an ideal compromise
between technology and demand, we will have to settle for something that is less ambitious, but that can
guarantee correct feedback at least in the majority of the cases.
4. Recommendations for future CAPT systems
In this analysis of available CAPT systems, we have observed that various devices are sometimes used without
an underlying pedagogical criterion, simply to make a fancy-looking product. In other cases, displays that are
easy to produce are used, while they either have little pedagogical value, or are not transparent for the student
and thus require support from an expert. We therefore suggest that developers first focus on the learner’s needs
and accordingly select functionalities and technology that meet those needs. A promising way to do this is by
incorporating the indications that are available from research on second language acquisition and teaching. In
this way, it is possible to suggest ways to design CAPT systems that make use of advanced technologies to
achieve pedagogical effectiveness.
As a general rule, i.e. whenever the learner does not have special needs, it seems advisable to include
pronunciation training within a comprehensive programme based on a communicative approach, as such an
environment is more likely to lead to meaningful, contextualized learning. Similarly, pronunciation training
should aim at attainment of speech intelligibility, rather than ‘nativeness’ or accent -free pronunciation. As to the
basic technical requirements an ideal system should fulfill, it is mandatory to include multimedia while keeping
the navigation through the system intuitive and easy, in order to make the learning setting as ‘human’ and
realistic as possible, and to prevent ‘technophobia’ in the students (Murray & Barnes 1998).
The system should contain a considerable amount of input in the L2. Input should ideally be presented in
interactive audio-visual material produced by different native speakers, such as film fragments and radio
interviews. Detailed study of articulatory movements could be catered for by means of 3D computer animations
of the lips and oral cavity. Simulations of real-life situations that are particularly likely to be experienced by the
learners should serve as learning context, because the relevance and the authenticity of this type of input can
boost the learner’s motivation.
10
The system should also include tasks that stimulate the student to practise what s/he has learnt by interacting
with the system. Exercises should be realistic, varied and engaging and should not be limited to listen-and-repeat
drills with isolated sounds or words. Role-plays with the characters in the system or interactive dialogues with
the computer as those used in some of the systems we have presented are, for instance, a good method to let the
student practise. Obviously, this type of exercises is only possible when ASR technology specially tuned for
non-native speech recognition is used. Moreover, this implies that the speaker’s utterance has to be predictable
because state-of-the-art ASR is not able to recognize free output with a satisfactory degree of reliability. Having
to reckon with this constraint nevertheless means that the students will always be able to compare their output
with a model utterance. Another means to allow the students to self-monitor their problems and progress is to
automatically store each utterance in a log-file.
Ideal systems should always include an option to provide feedback by means of ASR technology, so that the
user can receive immediate, individualized information on his/her performance. Because of the limitations of this
technology, error diagnosis will only be possible with a limited degree of detail. Automatic feedback on the
students’ responses could for instance be given in real -time at two levels: a graded-bar could be used to score
overall comprehensibility, while presumably incorrect areas are highlighted. A description of mispronounced
phones or syllables could then be offered by means of visual and aural feedback, with an option for comparing
input with output. In assessing pronunciation performance, both segmental and supra-segmental aspects should
be considered, such as temporal and spectral quality of speech sounds, word-stress, and sentence-accent.
Furthermore, in order not to discourage the students, a maximum number of errors to be pinpointed should
be set for each utterance. The feedback system could also be programmed to select and address only certain
pronunciation errors. Instead of trying to match an entire waveform, for example, the student’s attention might
be directed only to a few specific deviations. A question that could arise at this point is according to which
criteria the errors should be selected. We would like to suggest at least four important criteria that could be used
to select errors with a view to increasing the efficiency and the effectiveness of CAPT systems: 1) error
frequency, 2) error persistence, 3) perceptual relevance and 4) robustness of error detection.
First, the importance of error frequency is obvious: addressing errors that are infrequent will have little
impact on pronunciation performance and will therefore not significantly contribute to improving
communication. Second, concentrating on persistent pronunciation errors is a question of efficiency. Why should
we put effort in errors that simply disappear through exposure to the L2? Third, focusing on errors that are
perceptually relevant is a direct consequence of the ultimate aim of pronunciation training as we see it:
improving learners’ intelligibility. It follows that priority should be given to those errors that slow down and
even hamper communication. Finally, as explained above, not all pronunciation errors can be detected
automatically with a sufficient degree of robustness. As mentioned above, since reliability is crucial in language
learning, only errors that can be reliably detected should be addressed.
To conclude, in this paper we have outlined some pedagogical requirements that CAPT should ideally meet,
and we have looked at how those requirements are technologically implemented in available CAPT systems. We
hope that the suggestions we have given for future work can contribute to ameliorating CAPT design. However,
further research is needed to establish the effectiveness of specific systems that employ the functionalities
suggested here.
Acknowledgements
The present research was supported by the Social Science Research Council (MaGW) of the Netherlands
Organization for Scientific Research (NWO).
References
Abercrombie, D. (1991[1956]) Teaching pronunciation, in A. Brown (ed.), Teaching English Pronunciation: A
Book of readings, Routledge: London, 87-95.
Akahane-Yamada, R., McDermott, E. (1998) Computer-based second language production training by using
spectrographic representation and HMM-based speech recognition scores, Proceedings of ICSLP, Sydney,
Australia.
Anderson-Hsieh, J. (1992) Using electronic visual feedback to teach suprasegmentals, System 20, 51-62.
Anderson-Hsieh, J., Johnson, R., Koehler, K. (1992) The relationship between native speaker judgements of
nonnative pronunciation and deviance in segmentals, prosody and syllable structure, Language Learning 42,
529-555.
Auralog (2000) Tell me More, User’s Manual, Montigny-le-Bretonneux, France.
Ausubel, D. (1968) Educational Psychology: A Cognitive View. New York: Holt, Rinehart & Winston.
11
Betteraccent (2002) Our methodology, http://www.betteraccent.com/baapproach.htm. Last consulted
25/02/2002.
Bongaerts, T. (1999) Ultimate attainment in L2 pronunciation: The case of very advanced late learners, in
Birdsong, D. (ed.) The Critical period Hypothesis and Second language Acquisition, Mahwah, NJ: Lawrence
Erlbaum.
Bongaerts, T. (2001) Age-related differences in the acquisition of L2 pronunciation: The Critical Period
Hypothesis revisited, paper presented at the EUROSLA pre-conference workshop on the Age Factor in L2
acquisition, Paderborn, Germany.
Brennan, E.M, Brennan J.S. (1981) Accent scaling and language attitudes: Reactions to Mexican American
English speech, Language and Speech 24, 207-221.
Brown, I. (2001) Pro-Nunciation: The English Communication Toolkit, CALICO Journal Software Reviews,
CALICO Journal 19, 205-217.
CALICO Reviews (2002) The CALICO Review, http://astro.temple.edu/~jburston/CALICO/index.htm. Last
consulted 26/02/2002.
CALL Product Reviews (2002) Computer Assisted Language Learning @ Chorus, http://wwwwriting.
berkeley.edu/chorus/call/index.html. Last consulted 26/02/2002.
Celce-Murcia, M., Brinton, D.M., Goodwin, J.M., (1996) Teaching Pronunciation, Cambridge: CUP.
Chapelle, C.A. (1997) CALL in the year 2000: Still in search of research paradigms? Language Learning and
Technology 1, 19-43. Available online at http://llt.msu.edu/vol1num1/chapelle/default.html. Last consulted
27/02/2002.
Chapelle, C.A. (2001) Innovative language learning: Achieving the vision, ReCALL 13, 3-14.
Chaudron, C. (1977) A descriptive model of discourse in the corrective treatment of learner’s errors, Language
Learning 27, 29-46.
Chun, D.M. (1998) Signal analysis software for teaching discourse intonation, Language Learning and
Technology 2, 61-77. Available online at http://llt.msu.edu/vol2num1/article4/index.html. Last consulted
27/02/2002.
Cole, R., Carmell, T., Connors, P., Macon, M., Wouters, J., De Villiers, J., Tarachow, A., Massaro, D., Cohen,
M., Beskow, J., Yang, J., Meier, U., Waibel, A., Stone, P., Fortier, G., Davies, A., Soland, C. (1998)
Intelligent animated agents for interactive language learning, Proceedings of InSTILL, Marholmen, Sweden,
163-166.
Crompton, P. and Rodrigues, S. (2001) The role and nature of feedback on students learning grammar: A small
scale study on the use of feedback in CALL in language learning, Proceedings of the workshop on Computer
Assisted Language Learning, Artificial Intelligence in Education Conference, San Antonio, Texas, 70-82.
Crookes, G., Schmidt, R.W. (1991) Motivation: Reopening the research agenda, Language Learning 41, 469-
512.
Cucchiarini, C., Strik, H., Boves, L. (2000) Different aspects of pronunciation quality ratings and their relation to
scores produced by speech recognition algorithms, Speech Communication 30, 109-119.
De Bot, K. (1983) Visual feedback of intonation I: Effectiveness and induced practice behavior, Language and
Speech 26, 331-350.
De Bot, K. (1996) The psycholinguistics of the Output Hypothesis, Language Learning 46, 529-555.
Derwing, T.M., Munro, M.J. (1997) Accent, intelligibility, and comprehensibility, Studies in Second Language
Acquisition 20, 1-16.
Derwing, T.M., Munro, M.J., Wiebe, G. (1998) Evidence in favour of a broad framework for pronunciation
instruction, Language Learning 48, 393-410.
Dörnyei, Z. (1998) Motivation in second and foreign language learning, Language Teaching 31, 117-135.
Ehsani, F., Knodt, E. (1998) Speech technology in computer-aided learning: Strengths and limitations of a new
CALL paradigm, Language Learning and Technology 2, 45-60. Available online at
http://llt.msu.edu/vol2num1/article3/index.html. Last consulted 27/02/2002.
Eskenazi, M. (1999) Using automatic speech processing for foreign language pronunciation tutoring: Some
issues and a prototype, Language Learning and Technology 2, 62-76. Available online at
http://llt.msu.edu/vol2num2/article3/index.html. Last consulted 27/02/2002.
Eurotalk (2002) http://www.eurotalk.co.uk/ETWebPages/Products/DVDF.html. Last consulted 27/02/2002.
Ferrier, L. Reid, L. (2000) Accent modification training in The Internet Way®, Proceedings of InSTILL,Dundee,
Scotland, 69-72.
Flege, J.E. (1995) Second-language speech learning: Findings and problems, in Strange, W. (ed.), Speech
Perception and Linguistic Experience: Theoretical and Methodological Issues, Timonium, MD: York Press,
233-273.
12
Flege, J. E. (1999) Age of learning and second language speech, in Birdsong, D. (ed.) Second Language
Acquisition and the Critical Period Hypothesis, Mahwah, NJ: Lawrence Erlbaum Associates, 101-131.
Franco, H., Neumeyer, L., Digalakis, V., Ronen, O. (2000) Combination of machine scores for automatic
grading of pronunciation quality, Speech Communication 30, 121-130.
Germain-Rutherford, A., Martin, P. (2000) Présentation d’un logiciel de visualisation pour l’apprentissage de
l’oral en langue seconde, ALSIC 3, 61-76. Available online at http://alsic.u-strasbg.fr/Menus/frameder.htm.
Last consulted 25/02/2002.
Glearner (2001) www.glearner.com. Last consulted 10/05/2001.
Guiora, A.Z., Brannon, R.C., Dull, C.Y. (1972) Empathy and second language learning, Language Learning 22,
111-130.
Hillenbrand, J., Getty, L.A., Clark, M.J., Wheeler, K. (1995) Acoustic characteristics of American English
vowels, Journal of the Acoustical Society of America 97, 3099-3111.
Holland, V.M., Kaplan, J.D., Sabol, M.A. (1999) Preliminary tests of language learning in a speech-interactive
graphics microworld, CALICO Journal 16, 339-359.
ILT (1997) Interactive Language Tour, München: Digital Publishing.
ISLE 1.4 (1999) Pronunciation training: Requirements and solutions, ISLE Deliverable 1.4. Available online at
http://nats-www.informatik.uni-hamburg.de/~isle/public/D14/D14.html. Last consulted 27/02/2002.
ISLE 4.5 (2001) Error diagnosis for spoken language, ISLE Deliverable 4.5. Available online at http://natswww.
informatik.uni-hamburg.de/~isle/public/D45/D45.html. Last consulted 27/02/2002.
Jones, R.H. (1997) Beyond ‘Listen and Repeat’: Pronunciation teaching materials and theories of Second
Language Acquisition, System 25, 103-112.
Kay (2002) Kay, http://www.kayelemetrics.com. Last consulted 26/02/2002.
Keller, J. M. (1983) Motivational design of instruction, in C. M. Reigelruth (ed.), Instructional Design Theories
and Models: An Overview of their Current Status, Hillsdale, NJ: Lawrence Erlbaum, 383-434.
Kendrik, H. (1997) Keep them talking! A project for improving students’ L2 pronunciation, System 25, 545-560.
Kommissarchik, J., Komissarchik, E. (2000) Better Accent Tutor – Analysis and visualization of speech
prosody, Proceedings of InSTILL 2000,Dundee, Scotland, 86-89.
Krashen, S.D. (1981) Second language Acquisition and Second Language Learning, Oxford: Pergamon Press.
Krashen, S.D., Terrell, T.D (1983) The Natural Approach: Language Acquisition in the Classroom, Oxford:
Pergamon Press.
Lambacher, S. (1999) A CALL tool for improving second language acquisition of English consonants by
Japanese learners, Computer Assisted Language Learning 12, 137-156.
LaRocca, S., Morgan, J., Bellinger, S. (2001) Optimizing speech recognition for use by learners of less
commonly taught languages, Show and Tell presentation, EuroCALL, Nijmegen, The Netherlands.
Levy, M. (1997) Computer-assisted Language Learning: Context and Conceptualization, Oxford: Clarendon
Press.
Leather, J., James, A. (1996) Second language speech, in Ritchie, W.C., Bhatia, T.K. (eds.) Handbook of Second
Language Acquisition, San Diego, CA: Academic Press, 269-316.
Lightbown, P.M. (2001) Input filters in second language acquisition, EUROSLA Yearbook 1, Amsterdam: John
Benjamins, 79-97.
Liontas, J. (2002) CALLMedia Digital Technology: Whither in the new millennium, CALICO Journal 19, 315-
330.
LLT Software Reviews (2002) LLT Archives - Software Reviews, http://llt.msu.edu/archives/software.html. Last
consulted 27/02/2002.
Logan, J.S., Lively, S.E., Pisoni, D.B. (1991) Training Japanese listeners to identify English /r/ and /l/ III: Longterm
retention of new phonetic categories, Journal of the Acoustical Society of America 89, 874-886.
Long, M.H. (1996) The role of the linguistic environment in second language acquisition, in Ritchie, W.C.,
Bhatia, T.K. (eds.) Handbook of Second Language Acquisition, San Diego, CA: Academic Press, 413-468.
Lyster, R. (1998) Negotiation of Form, Recasts, and Explicit Correction in relation to error types and learner
repair in immersion classrooms, Language Learning 48, 183-218.
Lyster, R. Ranta, L. (1997) Corrective feedback and learner uptake, Studies in Second Language Acquisition 19,
37-66.
Massaro, D.W. (1987) Speech Perception by Ear and Eye: A Paradigm for Psychological Enquiry, Hillsdale,
NJ: Lawrence Erlbaum.
Menzel, W., Herron, D., Bonaventura, P., Morton, R. (2000) Automatic detection and correction of non-native
English pronunciations, Proceedings of InSTILL, Dundee, Scotland, 49-56.
MITAS (2002) Multimedia Instructional Tutoring and Authoring System with 3D,
http://www.maad.com/MaadWeb/products/mitas/mitasma.htm. Last consulted 25/06/2002.
13
Molholt, G. (1988) Computer-assisted instruction in pronunciation for Chinese speakers of American English,
TESOL Quarterly 22, 91-111.
Molholt, G. (2001) Three modes of visualization, Paper presented at InSTILL, EuroCALL, Nijmegen, The
Netherlands.
Morley, J. (1991) The pronunciation component in teaching English to speakers of other languages, TESOL
Quarterly 25, 481-519.
Munro, M.J., Derwing, T.M. (1995) Foreign accent, comprehensibility and intelligibility in the speech of second
language learners, Language Learning 45, 73-97.
Murray, G.L. (1999) Autonomy in language learning in a simulated environment, System 27, 295-308.
Murray, L., Barnes, A. (1998) Beyond the ‘wow’ factor – evaluating multimedia language learning software
from a pedagogical point of view, System 26, 249-259.
Nagano, K., Ozawa, K. (1990) English speech training using voice conversion, Proceedings of ICSLP, Kobe,
1169-1172.
Nagata, N. (1993) Intelligent computer feedback for second language instruction, The Modern Language Journal
77, 330-339.
Nicholas, H., Lightbown, P.M., Spada, N. (2001) Recasts as feedback to language learners, Language Learning
51, 719-758.
Nieuwe Buren (2002) Nieuwe Buren, http://www.nieuweburen.nl. Last consulted 26/02/2002.
Nouza, J. (1998) Training speech through visual feedback patterns, Proceedings of ICSLP, Sydney, Australia.
Nunan, D. (1989) Designing Tasks for the Communicative Classroom, Cambridge, UK: CUP.
O’Malley, J.M., Chamot, A.U. (1990) Learning Strategies in Second Language Acquisition, Cambridge: CUP.
Oxford, R.L., Anderson, N.J. (1995) A crosscultural view of learning styles, Language Teaching 28, 201-215.
Pennington, M.C. (1999) Computer-aided pronunciation pedagogy: Promise, limitations, directions, Computer
Assisted Language Learning 12, 427-440.
Piske, T., MacKay, I.R.A., Flege, J.A. (2001) Factors affecting degree of foreign accent in an L2: A review,
Journal of Phonetics 29, 191-215.
Precoda, K., Halverson, C.A., Franco, H. (2000) Effects of speech recognition-based pronunciation feedback on
second-language pronunciation ability, Proceedings of InSTILL, Dundee, Scotland, 102-105.
Price, P. (1998) How can speech technology replicate and complement skills of good language teachers in ways
that help people to learn language? Proceedings of InSTILL, Marholmen, Sweden, 81-86.
Pro-nunciation (2002) Products, http://users.zipworld.com.au/~pronunce/products.htm. Last consulted
26/02/2002.
Pujolà, J.-T. (2001) Did CALL feedback feed back? Researching learners’ use of feedback, ReCALL 13, 79-98.
Rogers, C., Dalby, J. (1996) Prediction of foreign-accented speech intelligibility from segmental contrast
measures, Journal of the Acoustical Society of America 100, Pt. 2, 2725 (A).
Ross, K. (2001) Teaching languages with asynchronous voice over the Internet, Paper presented at InSTILL,
EuroCALL, Nijmegen, The Netherlands.
Rubin, J. (1987) Learning strategies: Theoretical assumptions, research history and typology, in Wenden, A.L.,
Rubin, J. (eds.) Learner Strategies in Language Learning, Englewood Cliffs, NJ: Prentice Hall, 15-30.
Schachter, J. (1998) Recent research in language learning studies: Promises and problems, Language Learning
48, 557-583.
Schmidt, R.W. (1990) The role of consciousness in second language learning, Applied Linguistics 11, 129-158.
Scovel, T. (1988) A Time to Speak. A Psycholinguistic Inquiry into the Critical Period for Human Speech,
Rowley, Mass.: Newbury House.
Swain, M. (1985) Communicative competence: some roles of comprehensible input and comprehensible output
in its development, in Gass, M.A., Madden, C.G. (eds.) Input in Second Language Acquisition, Rowley MA:
Newbury House, 235-253.
Swain, M., Lapkin, S. (1995) Problems in output and the cognitive process they generate: A step towards second
language learning, Applied Linguistics 16, 371-391.
TraciTalk (2002) Traci Talk, The Mystery, http://www.encomium.com/CPI/CPITTTM.html. Last consulted
24/06/2002.
TTM (2002) Talk to Me, the Conversation Method, http://www.auralog.com/en/talktome.html. Last consulted
26/02/2002.
Tutsui, M., Masashi, K., Mohr, B. (1999) Closing the gap between practice environments and reality: An
interactive multimedia program for oral communication training in Japanese, Computer Assisted Language
Learning 11, 1999, 125-151.
Van de Voort, M. (1999) Gluren naar de buren, Nijmegen: UTN.
14
Van Heuven, V.J.J.P., Kruyt, J.G., de Vries, J.W. (1981) Buitenlandsheid en begrijpelijkheid in het Nederlands
van buitenlandse arbeiders; een verkennende studie, Forum der Letteren 22, 171-178.
Wachowicz, K., Scott, B. (1999) Software that listens: It’s not a question of whether, it’s a qu estion of how,
CALICO Journal 16, 253-276.
Warschauer, M., Healey, D. (1998) Computers and language learning: An overview, Language Teaching 31, 57-
71.
Watts, N. (1997) A learner-based design model for interactive multimedia language learning packages, System
25, 1-8.
Wharton, G. (2000) Language learning strategy use of bilingual foreign language learners in Singapore,
Language Learning 50, 203-243.
WinPitch (2002) Pitch Instruments Inc., http://www.winpitch.com. Last consulted 26/02/2002.
Young, D.J. (1990) An investigation of students’ perspectives on anxiety and speaking, Foreign Language
Annals 23, 539-553.
Tidak ada komentar:
Posting Komentar