top of page
Search
  • Writer's pictureAdam Bramlett

Mandarin Tone Acquisition as a Multimodal Learning Problem: Tone 3 Diacritic Manipulation

Tone acquisition continues to receive an increasing amount of scholarly attention in both first language (L1) and second language (L2) acquisition research (e.g., Liu et al. 2011; Wiener, 2018; Zhang, 2018). Research focusing on tone acquisition for learners without lexical tone experience sits at an interesting intersection between theories on learning and non-native speech perception. Learning lexical tone is a perceptual learning task, which requires an investigation of more basic processes (perception, attention, and salience) (Liu et al., 2011; Cintrón-Valentín & Ellis, 2016). Mandarin tone 3 (T3) provides a potentially fruitful opportunity for investigating the basic processes of non-native speech perception and learning. T3 is notorious for both its variation in forms and difficulties to acquire in L2 learning (e.g., Zhang, 2018; Duanmu, 2007). In phonological research, much debate surrounds T3 concerning its basic form. The most common representations are referred to as low T3 (21) and full T3 (214)/falling-rising (further explanation of tone number systems will be discussed, in the Role of Multimodal Representations in Tone Learning, below). There has been a long-standing debate in the phonology literature on the representation of T3. However, the conversation has only just begun in non-native speech perception.

The Role of Multimodal Representations in Tone Learning Lexical tone is a word level feature. However, its orthographic representation is made difficult by confusion on how it should be made salient in written form. Like many other tone languages, Mandarin tone is described primarily as a series of F0 patterns. The F0 patterns of Mandarin tones have been conceptualized with various representation systems. Chao (1930) was the first to propose a tonal representation system. His system, the five-point scale (FPS), uses successive numbers to describe tone on each syllable, “1” being the lowest pitch and “5” being the highest pitch. Change in number in FPS implies contour. Chao (1930) also introduced Gwoyeu Romatzyh (GR), which aims to emphasize the internal importance (word-level relationship between tone and syllable) of tone by using spelling alternations for different tones. Like the current study, GR spelling alternations were created with the intention to make the tonal system of Mandarin accessible to speakers of non-tonal languages (Chao, 1948). Hanyu Pinyin (Pinyin) was first developed from Chao’s FPS system by a group of linguists in the 1950s and modified to its current standard in 1995 (Zhou, 1995). Pinyin is the predominant romanization system used today. See table 1 for examples of the FPS system, Pinyin, and GR.


These systems of tonal representation, in essence, aim to use visual means in order to make tone a salient feature of the romanization system. That is, they use multimodal means to represent phonological or phonetic categories to assist the learning of tones. These systems were developed for two primary reasons: 1) to descriptively represent the categories of tone, e.g., representing T3 as mǎ, and 2) to aid the learning of tones by L2 learners (in the case of GR). However, it is not necessarily true that the descriptive representation of a tone and the best solution for L2 learners would be the same. In other words, it is possible that the underlying representation of tone for native speakers may be most accurately represented by one visual form, while the most effective visual representation for tone learners could be a different visual form. More recent work on tone is gaining the necessary tools to understand the effectiveness of tone representation systems and their sub-parts for L2 learning, as I outline below. Recent work has continued this tradition by testing new systems which utilize other visual means such as color, shape, and a combination of cueing mechanisms. McGinnis (1997) examined the relationship between romanization systems and learning efficacy. His study examined Mandarin learning through two romanizations of Mandarin, GR and Pinyin. Comparisons between Pinyin and GR showed that students learning with the same content material and teaching methods have faster learning and longer retention with better long-term accuracy when using Pinyin. It is notable that Pinyin bears better learning results for tones of Name Tone 1 Tone 2 Tone 3 Tone 4 FPS ma55 ma35 ma21(4) ma51 Pinyin mā má mǎ mà GR mha ma maa mah both non-tonal L1 and pitch-accent L1 learners (English and Japanese, respectively). The consistent diacritics -markings above the syllable to represent tone categories- in Pinyin are helpful for recognizing tone as a category, while GR represents tone with various spelling alternations depending on the vowels of the respective word (see table 1). In Liu et al. (2011), the researchers examined the subparts of Pinyin in an identification task attempting to understand the relationship between Mandarin tone learning and the subparts of the Pinyin system (Pinyin-spelling romanization and tone diacritics). The results favored the combination of diacritics and Pinyin-spellings over Pinyin-spellings with FPS marked numbers and diacritics without Pinyin-spellings. That is, participants trained using the regular Pinyin system showed better identification than participants training using a system which only uses diacritics or a system which uses numbers to replace the diacritics. It is suggested that this may reflect a modality principle which utilizes both visual and auditory information of the pitch contour and register (pitch height) (Mayer & Moreno, 2002). The visual representation input helps confirm the auditory input and allows for a more robust category to be formed. Liu et al. (2011) argue that the correspondence between the pitch contours and the pitch height of both visual and auditory stimuli created an ease in processing, where the numeric indicators were more burdensome; the lack of attention to the direction of pitch change by non-tonal L2 learners of Mandarin (Gandour, 1983) is remedied by making salient the visual representation of tones, which is supported by the intuitive nature of relating pitch height variation to height in physical space (Rusconi et al., 2006). Godfroid et al. (2017) continued the investigation of multimodal learning for Mandarin tone through an online training session which explored color, shape, symbol, and number. The researchers used several different cueing mechanisms to examine the learning of tones through an identification task. While color variation did show improvement, symbol and number showed more robust results, better identification, across participants and cueing mechanisms. While some form of Pinyin has been used to visually represent tone for Mandarin since the 1950s, Sumby and Pollack (1954) were the first to explore speech perception as a multimodal cognitive task. Likewise, Casasanto et al. (2003) were the first to provide evidence for a link between spatial movement and pitch movement in the general cognitive domain; their experiment showed identical auditory stimuli with higher visual spatial placement to be perceived as higher pitch. Similarly, Mitchel and Weiss (2011) found cross-modal input of non- linguistic tone to show decreased learning in treatments which have less cross-modal (visual and auditory) coherence. In regard to lexical tone specifically, several more recent experiments have tested facial and gestural congruency in tone learning with improved learning in conditions that provide congruency across audio-visual input (e.g., Hannah et al., 2017; Baills et al., 2019). Stated simply, visuospatial information has been found to be consistently tied to pitch information across general perception studies and language-specific learning studies for tone. In summary, the dual purpose of visual representation systems for tone creates a tension between a descriptive model of tone and an L2 learning aid. T3's basic representation for native speakers is still being debated. However, the low T3 (21) is the more common form of T3 in connected speech (Duanmu, 2007). While the respective frequencies of T3 low and full T3 in and out of the classroom are currently unknown, the question remains open as to whether T3 low would be a better representation of T3 for L2 learners. Results of multimodal research support highly congruent visual and auditory information in tone learning. Variation of Mandarin T3 provides an opportunity for exploring congruency between visual and auditory input. The next section explores the learning of tones using Pinyin to better understand Mandarin tone learning with particular emphasis on the congruency between tones and their visual representations. Congruence Between Pinyin and Mandarin Tones As previously mentioned, Mandarin tone can be described with two principal perceptual features: pitch height and pitch direction of change (Gandor, 1983). Here, I describe the visual coherence between Pinyin tone representation and Mandarin tones. Mandarin T1 is a high flat tone represented in "mā", which is congruent across pitch levelness, however incongruent in representing the height of T1. In Mandarin T2 and T4, má and mà, are congruently represented in Pinyin for both pitch height and pitch contour. Due to its variation in form, T3 is more complicated. Mandarin T3 is represented in "mǎ", which accurately represents the falling and rising of full T3, however misrepresents the onset, or starting point, of the tone contour. For low T3, pitch height is implicit (lowness is part of the full T3 diacritic) within the falling-rising contour of Pinyin, however the contour is incongruent. At the time of writing this study, major textbooks that use Pinyin utilize “v”, (e.g., mǎ) as the diacritic for T3. When compared to “一”, “/”, and “\” (T1, T2, and T4), “v” (T3) visually implies the most contour. However, in several studies, the lowness of T3 has been found to be the only perceptually salient cue for T3 recognition in context, out of context, and in varying sentence positions (Duanmu, 2007; Xu, 1997; Xu, 2015). Furthermore, the falling-rising T3 form (which does occur in citation form and slowed speech) is less common than low T3 in connected speech (Shi & Li, 1997). Furthermore, Suo (2019) found success in the teaching of low T3 over full T3. She found that the use of low T3 (21) showed improved results in both listening and production performance than the use of 214. In her study, four Mandarin classrooms were separated into two conditions and taught with one of two T3 variations for the duration of 9.5 weeks. In the first condition, students were taught using full T3. In the second condition, students were taught using low T3. For the second condition, teachers did not change the Pinyin T3 to low form. Teachers simply refrained from using the full variation of T3 even when producing T3 independently. Results showed that there was a statistically significant difference between both students’ oral reading and listening scores between the two conditions. The listening task is a forced-choice task, which requires students to decide which tones that they heard by selecting a disyllabic word. The oral reading task is a production task, which was rated by three trained native Mandarin speakers. A delayed post-test was conducted 18 weeks after the instruction and effects remained significant between conditions. Results in this study consistently showed that teaching T3 as low tone improved students' learning of Mandarin tones. Although auditory input in the Mandarin classroom may tend to exemplify the full form of T3 (frequent slowed speech and citation form speech), no studies have directly examined the use of full and low T3 in and out of the classroom. This visual cue difference may be causing long-term acquisition problems as the salience of the contour for T3 is the most salient part of T3 rather than the lowness in pitch of T3. The misalignment between auditory and visual input (Mandarin classroom aural input and explicit instruction) may lead to perceptual incongruence. The difficulty of T3 across L2 Mandarin learners and the variation in forms of T3 calls for an experiment considering the effect of congruence between pitch height and the visual representation of T3 itself. Although it has been claimed that low T3 is better for L2 learners (Suo, 2019; Zhang, 2018), no studies have explored current learners’ explicit knowledge of low T3 or the effects of using low T3 as a visual representation of T3.

Experiment: Tone Identification Task As McGinnis (1997) indicates, Pinyin diacritics are one of the most effective representation systems for Mandarin tones in L2 learning as they represent F0 movement, which provides a visual link between pitch movement and tone category. Phonologically, there is debate between the actual representation of full T3 and low T3, while scholars generally agree upon tones 1, 2, and 4. If the full representation of T3 (e.g., mǎ) is the most accurate phonological representation of T3 for native speakers, it is still not necessarily true that L2 learners learn most effectively with this abstracted form in tone acquisition. In the case that the most accurate underlying representation of T3 is low tone, the “v” visual representation may lead L2 learners' attention to be guided to contour instead of the salient features of T3 (lowness). Regardless, the most effective visual representation in L2 tone learning does not necessarily match native speaker underlying representations. The accuracy of visual representation of a tone is crucially important for the L2 learner, as it acts as a multimodal confirmation of the auditory stimuli. The learning studies mentioned above have found that visual representation of tone can impact the efficiency and efficacy of learning lexical tones of Mandarin (e.g., Liu et al., 2011; Hannah et al., 2016; Baills et al., 2018). Mandarin tone learning has been studied extensively. Suo (2019) shows that orally teaching T3 as a low tone has better learning results across classrooms. However, no studies have considered changing the shape of tone diacritics for L2 tone learning to explore the effects of congruence between pitch height and visual height. While earlier studies have claimed that teaching T3 as a low tone improves perceptual categorization of T3 (Suo, 2019) and claimed that T3 is primarily a low tone (Zhang, 2018), no study has developed an alternative solution for T3 diacritic representation, nor experimentally tested the effects of multimodal training to confirm the hypothesis of improved learning of T3 as a low tone. The following research questions can then be asked.

  1. To what extent does changing the third tone diacritic to a low tone affect identification of T3?

  2. To what extent does improved identification of T3, if there is any, improve categorization of T1, T2, and T4?

  3. To what extent does changing the third tone diacritic affect learning of Mandarin tones throughout training for overall identification and rate of learning?

These research questions lead to the following hypotheses. The first hypothesis, in regard to RQ1, in this experiment is that manipulating the T3 diacritic to a low tone will result in better identification of T3 with less confusion for T2 and T4. The second hypothesis, for RQ2, is that improved identification of T3 will improve the identification of other tones. Specifically, previous studies have claimed that T3 is confused for and with T2 and T4 (Kiriloff, 1969; Wang et al., 1999). Here I attempt to understand if improved learning of T3 will improve the identification of T2 and T4. It is unclear from previous work if the confusion of T3 is bidirectional. In analysis, I will analyze the distribution of wrong answers for T2 and T4 to see if there are bidirectional confusions for T3 or simply a unidirectional confusion (e.g., T3 is incorrectly selected for T2, but T2 is not incorrectly selected for T3). For both identification of T3 and other tones, improved identification will be examined through accuracy, total number of correct responses over the total amount of trials. Lastly, for RQ 3, it is predicted that participants that use the low T3 diacritic will have better overall identification of Mandarin tones throughout training. Overall, in this case, means collapsing over tone. Here, I ask if there will be an overall advantage to learning Mandarin tones with one T3 diacritic versus another. That is, I am examining the effects of changing the T3 diacritic on identification over training for amount correct for all tones, not just for any particular tone (as RQ 1 and RQ 2 do). In McGinnis (1997), participants learning Mandarin tones with Pinyin showed an improved starting point (identified more accurately from the start), with even better identification after a semester of using Pinyin over GR. Like his study, I predict that changing T3 will improve the identification of T3 and other Mandarin tones from the very beginning with better identification throughout the training. It is unclear whether rates of learning will change between groups in this short training session. However, because of perceptual ease of processing the low T3, it is therefore predicted that participants using low T3 will also learn at faster rates on average. Here, the prediction is that participants using low T3 will on average show better tone identification overall throughout training and learn at faster rates. Participants 25 consenting adults with access to a computer running Google Chrome and owning a pair of headphones participated in the experiment for either course credit through the University of Hawaii at Manoa or an Amazon gift card valued at 5 USD. None of the participants had experience learning a tonal language. The experiment was done remotely by each participant in their own time. Headphone use was required and tested before the experiment began. Instructions were sent to the participants after they contacted the researcher with interest in participating. Instructions were written in a simple and clear way. No participants were confused by the procedure throughout testing. All participants reported normal or corrected hearing and eyesight. The mean age of participants was 32 years of age. Language background information was collected through a questionnaire to ensure that participants did not speak a tonal language that they were unaware is tonal. Language experience was self-reported: participants were instructed to list all known and studied languages in order of dominance, from strongest to weakest language. One participant was excluded in data analysis for experience with a tonal language. The remaining participants were randomly assigned into 1 of 2 conditions, namely system 1 and system 2, which will be described below in materials. Questionnaire data revealed no statistically meaningful differences between participants in the two conditions with respect to language background, age, musical training, or vocal training (see the appendix for detailed information). Materials The materials consisted of 32 target real Mandarin words recorded using the Shure MV88 application and recorder in soundproof booths by four phonetically trained Mandarin native speakers who are all Mandarin teachers in a university setting. This study used high variable phonetic training (HVPT) for its training identification task, which has been found to improve categorization of tones in prior tone learning research (e.g., Wiener ,2020). Four speakers were used to create more variability in productions of tones to increase category learning of tones. When speaker input is highly variable, learners must generalize over speaker-specific and word- specific acoustic similarities to correctly categorize tone with specific phonetic patterns (see Thomson,2012, for a full review of HVPT and Wiener, 2020, for use of HVPT in a production study of Mandarin tones). All sounds were recorded in the carrier sentence “我说 x 这个词 (I say the word X)”. The complete sentences were presented in a random order so that speakers would not get used to a specific ordering of tones or syllables so that speakers were encouraged to pronounce the words more naturally as they occur. Speakers were asked to say each sentence naturally as if speaking with a friend. The 32 real words are made up of four contrasting initials and two contrasting finals each recorded for all four tones (4 initials x 2 finals x 4 tones = 32 unique sounds by each speaker). The carrier sentence was designed to avoid sandhi tone issues. While research about sandhi tone is meaningful and interesting, it is not the focus of this study. The eight syllables used for Mandarin consisted of ba, bi, la, li, ma, mi, da, and di. Words were then manually extracted using Praat by two Mandarin speakers with training in phonetic transcription from complete sentences (Boersma & Weenink, 2021). The experiment was designed using Python and carried out on Psychopy’s Pavlovia online platform (Peirce et al., 2019; van Rossum, 1995). Procedure The experiment consisted of an identification training task. Each of the 128 individual tokens constitute a single trial. The 128 tokens were split up into four 32 trial-blocks. Blocks were organized by syllable, tone, and speaker. That is, each block had eight words from each speaker with two of each tone but only one recording of each possible syllable. Each of the four blocks have an equal amount of each tone and syllable. Each block is made up of 32 unique recordings that are only presented once throughout the entire experiment. Each block contains eight of each tone with a total of 32 of each Mandarin tone throughout the experiment. Token order was randomly presented within each block. Each participant was assigned to one of two groups which I will refer to as system 1 and system 2, each group has 12 participants. System 1 marks tones like the standard Pinyin system, with the exception of T1 being raised to match the FPS system (55). System 2 is the same as system 1 except for the visual representation of T3. In system 2, T3 is represented by a low falling tone or 21 in the FPS system. The two systems have identical words and block design. However, the visual representations are different. For tones 1, 2, and 4 the tones are represented the same. However, T3 is represented differently between the two systems. See figure 5 for the visual representations for the two systems.

The order of the blocks was counterbalanced to reduce the chances of block-specific problems being confounded with improvement in identification. On each trial, participants were asked to identify which visualization best matches the sound that is played using the “Q”, “P”, “Z”, “M” keys (see figure 5 for the keys next to each tone for the two systems). These keys are chosen to best match the position on screen. Each of the tones is consistently labeled with the same key. That is, T1 corresponds to “Q”, T2 corresponds to “P”, T3 corresponds to “Z”, and T4 corresponds to “M” (two examples for the two groups can be found in figure 5). After each trial, the participant was given individualized feedback with variable text depending on the correctness of their response. Feedback told them if their response was correct or incorrect, replayed the exact same sound file and showed the correct answer, as seen in figure 6. The participants were instructed to identify the tone they heard by listening to the word played and by using feedback from previous trials.

To test the hypotheses formed from research questions 1, 2, and 3 stated above, a two- way ANOVA and a linear regression were used to better understand participants' identification of tones, with particular emphasis on T3. The hypotheses of research questions 1 and 2 were explored with a two-way ANOVA that uses accuracy as the dependent measure. Research question 3, which aims to examine overall learning through identification and learning rate, was explored using a multiple linear regression. Answer selection by tone and block is shown in figure 7. Statistical test results will be provided below.

Note. Blocks are organized from top to bottom (Block 1-Block 4) and tone trials of each tone are organized from left to right (T1, T2, T3, and T4).

For the two-way ANOVA, tone (four levels: T1, T2, T3, and T4; within-subject) and system (two levels: system 1 and system 2; between-subject) constituted the independent variables (IVs) and mean response accuracy by participant constituted the dependent variable (DV). Accuracy scores were calculated by dividing each participant's correct answers for a given tone by the total number of trials for that tone, which was always 32. There was a significant effect of tone (F(3, 376) = 5.69, p < 0.001, partial η 2 = .043) and system (F(1, 376) = 28.05, p < 0.001, partial η 2 = .069) with a significant interaction effect between tone and system (F(3, 376) = 2.68, p < 0.05, partial η 2 = 0.020). In order to explore the nature of the interaction, pairwise comparisons between the two systems were conducted for each of the four tones, with alpha levels adjusted for multiple comparisons (.05/4 = .0125). The output of these comparisons, summarized in Table 2, indicated significant differences between the two systems for both T2 and T3, but not for T1 and T4. A visualization of accuracy data is provided in figure 8.


Note. The left side of each plot is a box plot that shows the first and third quantiles and median. The right side is a violin plot, which shows the distribution of data. A point plot is added to show actual data points of participants.

T3 and T2 showed better accuracy in system 2; However, this does not tell us anything about the incorrect answers for each tones’ trials. From figure 9, for T3 trials, it is clear that T4 is the primary distractor for system 1 participants but not for system 2 participants. For system 1, T4 and T3 were selected 11.42 and 11.50 average times for T3 trials per participant. However, for system 2 participants, T4 and T3 were selected an average of 5.75 and 18.5 times per participant. This means that the T4 diacritic is selected just as often as T3 for T3 trials in system 1, but not system 2. For T2 trials, T3 seems to be the primary distractor item for participants in system 1 but not system 2. Participants in system 1 on average selected T3 and T2 9.58 and 13.75 times, respectively. For comparison, system 2 participants selected T3 4.33 times, where they selected T2 an average of 19.50 times. One of the contributing reasons for accuracy improving seems to be less confusion for other tones. However, the confusion is tone specific and not distributed equally across tone answers. When comparing system 1 and system 2, T4 is selected more often for T3 trials. Similarly, for T2 trials, T3 is often chosen in T2’s place. However, this confusion for both T3 and T2 is unidirectional. In system 1, T4 is chosen for T3 trials and T3 is chosen for T2 trials, but the opposite is not true. T2 is not equally confused for T3 trials and T3 is not equally chosen on T4 trials.


To answer RQ3, a multiple linear regression was calculated to predict correctness based on system and trial number. For this analysis, I collapsed percent correct by trial over tones. Results of the multiple linear regression, lm (formula = percent correct ~ system*trial number), indicated that there was a significant effect for system and trial number (F(2, 252) = 26.3, p < .001, R2 = .23). The individual predictors were examined further and indicated that system (t = 4.42, p < .001) and trial number (t = 3.185, p < .01) were significant predictors in the model. However, no significant interaction was found. Results are visualized in figure 10. The dotted line is showing chance performance that would be expected if participants were choosing completely randomly. When examining the participants as groups, both systems perform above chance from the very beginning. The significant effect of system confirms that learning T3 as a low tone improves overall tonal identification throughout the training session. The main effect of trial number shows that both groups improved identification over the training session. However, because there was no interaction between system and trial number, we must accept the null hypothesis of rate of learning being the same between groups.

. Discussion Results of this experiment show that changing the diacritic of T3 affects not only identification of T3 itself but also improves identification of T2. Both groups performed better in the later trials than in earlier trials, which supports the efficacy of short-term tone training found in earlier studies (e.g., Bowles et al., 2016; Chandrasekaran et al., 2010; Chang & Bowles, 2015). Although participants using system 2 on average outperformed participants using system 1, rates of learning between systems were unaffected. Results of this experiment showed an overall improved identification of T3 for participants that were trained using system 2. This supports the first hypothesis that system 2 Percent correct participants correctly identified T3 sounds by choosing T3 more often on T3 trials. Further, T2 identification is improved for system 2 participants which means that the T2 visualization was chosen more often on T2 trials. This appears to be because of less confusion for T3. This study has reproduced the findings of previous studies for T2 confusion with T3 (e.g., Leather, 1983; Li & Thompson, 1977; Shen & Lin, 1991). However, this confusion is unidirectional. T3 is commonly selected for T2 trials; however, T2 is selected least often for T3 trials. Further, this confusion is remedied by making the visual contours more visually congruent with phonetic features. While system 1 participants showed confusion for T3 on T2 trials, system 2 participants did not confuse T3 for T2 any more often than T1 or T4. It is notable that both groups were unlikely to choose T2 for T3 trials. This is meaningful because previous literature has claimed that T3 is often confused for T2 and T4 (Kiriloff, 1969; Wang et al., 1999). While this is true for participants in system 1 for choosing T4 on T3 trials, it is not true for either T2 or T4 with system 2. RQ2 was answered with the same two-way ANOVA as RQ1. In the same way that T3 was analyzed, I analyzed the accuracy of T1, T2, and T4. For accuracy between systems, no difference was found for T1 or T4. However, T2 accuracy was shown to be statistically different between the two systems. This shows that improving identification of one tone can affect the identification of another tone. This can be confirmed by examining T2 trials’ incorrect answers, the accuracy improvement for T2 is mostly due to participants choosing T3 less often in system 2. As stated above, this means that system 1 participants reproduced the confusion of T3 and T2 found in several previous studies (e.g., Shen & Lin, 1991), while system 2 participants were better able to distinguish between the two tonal categories. This particular improvement is interesting for two reasons. 1) Many studies have found that T2 and T3 confusion is long term even for very advanced L2 learners (e.g., Zhang, 2018). Improvement in any way for naïve beginners may lessen the burden of this problem in the long-run for Mandarin L2 learners. 2) Within the regular Pinyin system or system 1, T3 looks equally similar to T2 and T4. However, in the system 2, T3 is much less visually similar to T2 but remains similar to T4. The contour direction within system 2 for T3 and T4 is the same. Having no improvement for T4 accuracy may be because of a persisting similarity between the visual contours. Likewise, the improvement for T2 and T3 accuracy could be due to the visual distinctiveness of the two contours, which allows for the more salient featural differences of both T2 and T3 to be noticed rather than their similarities. Lastly, to answer RQ3, I performed a linear regression to see if system and trial number predict the proportion of correctness across participants. Like early studies, this study shows that short term focused learning can improve tone learning (e.g., Chang & Bowles, 2015). Further, the analysis revealed a difference between both system and trial number. The results indicate that participants from both groups improved over the training session. With the addition that system 2 participants performed better than system 1 participants throughout training. From this perspective, we can say that system 2 participants learned the tones more effectively than system 1 participants by having better identification upon starting perception and improving to a higher overall identification of Mandarin tones. In this way, we can say that system 2 participants on average learned the categories of Mandarin tones better than participants in system 1. While there was a difference between tone identification over the training session, there was no meaningful difference between rates of learning. For this study, I interpret this the same way that Liu et al. (2011) interpret their results of better scores without faster rates. The rates of learning are the same between groups. Meaning that participants in system 1 and system 2 both successfully improved over the training. However, the advantages of system 2 are purely perceptual ease in processing. For L2 learners, processing T3 tones as a low tone forms a more perceptually immediate link between visual and auditory information. Helping them by bootstrapping tone learning on top of a previously acquired link between visio-spatial information and pitch information. This means that system 2 participants are not better at tone learning. Rather, system 2 participants are able to take advantage of knowledge that they already have to identify tones more correctly, specifically by confusing T2 and T3 less often from the start. Previous literature has repeatedly found that T2 and T3 show the most confusion in acquisition for both L1 children and L2 adults (Leather, 1983; Li & Thompson, 1977; Shen & Lin, 1991; Sun, 1998; Wiener, 2017; Zhang, 2018). This study confirms that, at least for L2 adult learners, this process can be made less problematic by making the primary perceptual features of T2 and T3 more salient rather than their similarities. General Discussion This study attempts to understand how multimodal input affects the process of learning tone categories in a tonal language. This study starts by exploring Mandarin learners’ explicit knowledge of Mandarin tones and finds that the majority of participants did not have explicit knowledge pertaining to T3 lowness. Next, this study looked to examine the effects of changing the visual representation of tones by manipulating the T3 diacritic to a low tone. Specifically, this study has found evidence for improved identification of Mandarin tonal categories when manipulating the tonal diacritic of Mandarin T3. This study makes no attempt to claim that T3 should be represented as a low tone descriptively, I only claim that the use of low T3 visualization improves the learning of Mandarin tones for adults when compared to full T3. In more general terms of tone learning outside of Mandarin, this study, like many other multimodal tone learning studies, has shown that some ways of representing tone are more or less useful for learning tones (Liu et al., 2011; McGinnis, 1997). The particular implications of this would be language specific. However, what is clear is that visual representation of a tone provides crucial information to the learner that allows them to learn tonal categories more or less effectively (Godfroid et al., 2017). One question that remains unanswered in the literature is if visual representations would be effective for variation in tone. This is particularly important for the case of T3. While this study has found that Mandarin T3 is learned more effectively by low T3 than full T3, it is important to remember that this study was done for in-context T3s (T3s that were spoken in context). That is, every sound was extracted from a carrier sentence. The actual T3 pronunciations have not been analyzed for lowness or fullness. It is unclear if representing a tone closer to the actual pitch contour would be equally helpful in a manner that is consistent with this study. Meaning that it may not necessarily be better to have variation in the representation of a tone visualization. However, because this study uses in-context T3 for trials, it is not unreasonable to assume that most of the T3 sound files were not full T3s (Duanmu, 2007). In this respect, this study reconfirms the link between visio-spatial information and pitch information found in studies of general perception (Casasanto et al., 2003). Like other recent studies looking at multimodal learning of tones (e.g., Baills et al., 2019; Hannah et al., 2017), this study has found that the manipulation of visual information improves the learning of tone by expanding this relationship to a more abstract visual means of tonal diacritics. For the case of T3, earlier studies have consistently found high confusion between T2 and T3 (e.g., Sun, 1998). This study confirms this confusion between T2 and T3, but only for participants that learn T3 using the full “v” visual representation. In contrast, participants who learn T3 using the low T3 do not confuse T3 with T2 any more often than any other tone. While, as stated, it may be best to represent T3 descriptively as a full T3, the findings of this study show that while it is important to descriptively understand the features of a language, it is equally important to understand the starting points of learners. The tension between using Pinyin as a descriptive model of Mandarin tone and an L2 learning aid has created an unnecessarily difficult task for the L2 learner. The findings from earlier studies for persistent problems with acquiring T3 in adult L2 acquisition may be partially resolved by changing the visual representation of T3. However, changes like this should only be done if future studies can continue to replicate the results of this study in more longitudinal experiments that carefully examine the ways that students are learning T1, T2, and T4 alongside T3. While the low T3 of system 2 showed improved learning, it is possible and likely that another representation could be even better (e.g., a low dot for T3 and a raised line for T1). It remains unclear which visual representation would be best for L2 learners. Specifically, it is still unclear if the low T3 visualization of system 2 would help or hinder the learning of the full T3 variation. However, for tone learning, it is clear that visual representation matters. Conclusion This study has two primary findings: 1) that many current Mandarin L2 learners lack explicit knowledge of the low features of T3 and 2) that manipulating the visualization of a tone to be more phonetically representative of F0 height can improve the learning of tones by acting as a multimodal confirmation of auditory information. Moreover, the T3 tone can be learned with the “v” diacritic or the 21 low tone diacritic presented in this study. However, the T3 “v” representation causes learners to have a more difficult time with making an immediate connection between auditory information and visual information. In a practical situation, teaching T3 as a full tone is not detrimental to learning. However, teaching T3 as a low tone does improve the ease of learning T3 and affects the overall accuracy of tonal identification. System 2 participants performed better throughout the identification training task overall because of T3 and T2 identification improvements; but the rates of learning between systems remained the same. When we are considering tone learning, it is essential to examine the differences between descriptive representation and L2 learning aid. While the description of a tone is crucially important for understanding the nature of tones in a particular language, we must not let these debates hinder the learning and potential successes of L2 learners.

References Baills, F., Suárez-González, N., González-Fuente, S., & Prieto, P. (2019). Observing and producing pitch gestures facilitates the learning of Mandarin Chinese tones and words. Studies in Second Language Acquisition, 41(1), 33–58. Boersma, P. & Weenink, D. (2021). Praat: doing phonetics by computer [Computer program]. Version 6.1.41, retrieved 25 March 2021 from http://www.praat.org/. Bowles, A. R., Chang, C. B., & Karuzis, V. P. (2016). Pitch ability as an aptitude for tone learning. Language Learning, 66(4), 774–808. Casasanto, D., Phillips, W., & Boroditsky, L. (2003). Do we think about music in terms of space? Metaphoric representation of musical pitch. Proceedings of the Annual Meeting of the Cognitive Science Behaviour, 25(25), 1323. Chandrasekaran, B., Sampath, P. D., & Wong, P. C. M. (2010). Individual variability in cue- weighting and lexical tone learning. The Journal of the Acoustical Society of America, 128(1), 456–465. Chang, C. B., & Bowles, A. R. (2015). Context effects on second-language learning of tonal contrasts. The Journal of the Acoustical Society of America, 138(6), 3703–3716. Chao, Y. R. (1930). A system of “tone-letters.” Le Maître Phonétique, 45, 24–27. Chao, Y. R. (1948). Mandarin Primer. Cambridge: Harvard University Press. Chao, Y. R. (1951). The Cantian idiolect: An analysis of the Chinese spoken by a twenty-eight- months-old child. University of California Publications in Semitic Philology, 1, 27-44. Cintrón-Valentín, M. C., & Ellis, N. C. (2016). Salience in second language acquisition: Physical form, learner attention, and instructional focus. Frontiers in Psychology, 7(AUG), 1–21. Duanmu, S. (2007). The Phonology of Standard Chinese (2nd Edition). New York, New York: Oxford University Press. Gandour, J. (1983). Tone perception in Far Eastern languages. Journal of Phonetics, 11(2), 149– 175. Godfroid, A., Lin, C. H., & Ryu, C. (2017). Hearing and seeing tone through color: an efficacy study of web-based, multimodal Chinese tone perception training. Language Learning, 67(4), 819–857. Gussenhoven, C. (2004). The Phonology of Tone and Intonation Tone. Cambridge, UK: Cambridge University Press. Hannah, B., Wang, Y., Jongman, A., Sereno, J. A., Cao, J., & Nie, Y. (2017). Cross-modal association between auditory and visuospatial information in Mandarin tone perception in noise by native and non-native perceivers. Frontiers in Psychology, 8(DEC), 1–15. Hyman, L. M. (2001). Privative tone in Bantu. In Gussenhoven, C. (2004). The Phonology of Tone and Intonation Tone. Cambridge, UK: Cambridge University Press. Kiriloff, C. (1969). On the auditory perception of tones in Mandarin. Phonetica, 20(2-4), 63-67. Kong, Y.-Y., & Zeng, F.-G. (2006). Temporal and spectral cues in Mandarin tone recognition. The Journal of the Acoustical Society of America, 120(5), 2830–2840. Leather, J. (1983). Speaker normalization in perception of lexical tone. Journal of Phonetics, 11, 373–382. Liang, Z. A. (1963). The auditory perception of Mandarin tones. Acta Physica Sinica, 26, 85–91. Lin, M. C. (1988). The acoustic characteristics and perceptual cues of tones in standard Chinese. Chinese Yuwen, 204, 182–193. Lin, T. & Wang, L. (1992). Yuyinxue jiaocheng [A course in phonetics]. Beijing: Peking University Press. Liu, Y., Wang, M., Perfetti, C. A., Brubaker, B., Wu, S., & MacWhinney, B. (2011). Learning a tonal language by attending to the tone: An in vivo experiment. Language Learning, 61(4), 1119–1141. Mayer, R., & Moreno, R. (2002). Aids to computer-based multimedia learning. Learning and Instruction, 12, 107–119. McGinnis, S. (1997). Tonal spelling versus diacritics for teaching pronunciation of Mandarin Chinese. The Modern Language Journal, 81(2), 228. Mitchel, A. D., & Weiss, D. J. (2011). Learning across senses: Cross-modal effects in multisensory statistical learning. Journal of Experimental Psychology: Learning Memory and Cognition, 37(5), 1081–1091. Perrachione, T. K., Lee, J., Ha, L. Y. Y., & Wong, P. C. M. (2011). Learning a novel phonological contrast depends on interactions between individual differences and training paradigm design. The Journal of the Acoustical Society of America, 130(1), 461–472. Peirce, J. W., Gray, J. R., Simpson, S., MacAskill, M. R., Höchenberger, R., Sogo, H., Kastman, E., Lindeløv, J. (2019). PsychoPy2: experiments in behavior made easy. Behavior Research Methods. 10.3758/s13428-018-01193-y. Rusconi, E., Kwan, B., Giordano, B. L., Umiltà, C., & Butterworth, B. (2006). Spatial representation of pitch height: The SMARC effect. Cognition, 99(2), 113–129. R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/. Shi, P. W. & Li, M. (1997). “Sansheng wenti yanjiu” [On the Third Tone]. In Language Studies and Teaching Chinese as a Foreign Language, edited by J. M. Zhao, 125–54. Beijing: Beijing Language and Culture University Press. Shen, J., Deutsch, D., & Rayner, K. (2013). On-line perception of Mandarin tones 2 and 3: Evidence from eye movements. The Journal of the Acoustical Society of America, 133(5), 3016–3029. Shen, X. S., & Lin, M. (1991). A perceptual study of Mandarin tones 2 and 3. Language and Speech, 34, 145– 156. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26(2), 212–215. Sun, S. H. (1998). The development of a lexical tone phonology in American adult learners of standard Mandarin Chinese. Honolulu, HI: University of Hawai’i Press. Suo, F. (2019) Hanyu shangsheng jiaoxue shiyan [An experiment on third tone teaching]. Unpublished Master’s thesis. Shanghai Foreign Language University. Thomson, R. I. (2012). Improving L2 listeners’ perception of English vowels: A computer mediated approach. Language Learning, 62, 1231–1258. Tsung, C. (1987). Half-third first: on the nature of the third tone. Journal of the Chinese Language Teachers Association, 22(1), 87-101. van Rossum, G. (1995). Python tutorial, Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, May 1995. Wang, Y. (2015) Mianxiang duiwaihanyu jiaoxuede Putonghua shangsheng yuyin yanjiu [Teaching third tone to L2 learners]. Unpublished Master’s thesis. University of Anhui. Wang, Y., Spence, M. M., Jongman, A., & Sereno, J. A. (1999). Training American listeners to perceive Mandarin tones. Journal of the Acoustical Society of America, 106, 3649-3658. Wiener, S. (2017). Changes in early L2 cue-weighting of non-native speech: Evidence from learners of Mandarin Chinese. In Interspeech (pp. 1765– 1769). Stockholm, Sweden: International Speech Communication Association (ISCA). Wiener, S., Ito, K., & Speer, S. R. (2018). Early L2 spoken word recognition combines input- based and knowledge-based processing. Language and Speech, 61, 632–656. Whalen, D. H., & Xu, Y. (1992). Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica, 49(1), 25–47. Wong, P. C. M., & Perrachione, T. K. (2007). Learning pitch patterns in lexical identification by native English-speaking adults. Applied Psycholinguistics, 28(4), 565–585. Xu, Y. (1997). Contextual tonal variations in Mandarin. Journal of Phonetics, 25, 61–83. Xu, Y. (2015). Intonation in Chinese. In W. S.-Y. Wang & C. Sun (Eds.), The Oxford Handbook of Chinese Linguistics (pp. 490–502). New York, NY: Oxford University Press. Yip, M. (2002). Tone. Cambridge, UK: Cambridge University Press. Zhang, H. (2018). Second Language Acquisition of Chinese Tones—Beyond First-Language Transfer. Boston, MA: Brill. Zhou, Y. (1995) Hanyu Pinyin jichu fangan zhishi [Basics of Hanyupinyin]. Beijing: Yuwen Audio and Video Publishing House. Zhu, X., & Wang, C. (2015). Tone. In The Oxford Handbook of Chinese Linguistics (pp. 503– 515). New York, NY: Oxford University Press.


Appendix

Language background results A two-sample t-test was run to compare the difference in the amount of language spoken/learned by participants across the two groups. No significant difference was found (t(22)=-0.92, p=0.37). Musical training results: A Mann-Whitney U test was run to compare the differences in the participants’ musical training across groups. No significant difference was found (W=65, p=0.97) Age results: A Mann-Whitney U test was run to compare the differences in the participants’ musical training across groups. No significant difference was found (W=69.5, p=0.85) Vocal Training results: A Mann-Whitney U test was run to compare the differences in the participants’ vocal training across groups. No significant difference was found (W=85, p=0.27).





54 views0 comments

Recent Posts

See All

R File Re-namer

This is a short and simple script that I made to rename files. This script is particularly useful if your files have a systematic naming issue. That is, the naming issue is the same throughout the nam

bottom of page