Exam in Musical communication and Music Technology



Exam in Musical communication and Music Technology

1.

a. In the article “Emotive transforms” (2000) Johan Sundberg discusses emotional expressivity in music. By studying deviations from the nominal score, he has identified two basic principles; grouping and differentiation. In expressive singing, these characteristics were emphasized. Moreover, semantically important words are marked more clearly. Both features can also be found in speech. Further, musical gestures are typically “terminated with a micropause, and subphrases and phrases are initiated by an accelerando from a slow tempo and terminated with a rallentando (a slow down of tempo)” (Sundberg, 2000). This kind of hierarchical groupings and phrasings is another common feature. Hence, it seems like speech and singing have much in common. Still, there are also some differences worth mentioning.

As pointed out in the article “Human singing voice” (Sundberg, 1997) the voices of bass, tenor and alto singers are characterized by an extraordinarily high spectrum envelope peak roughly between 2 and 3.5 kHz, the most sensitive frequency range of the ear. This peak is usually called the singer’s formant and occurs due to clustering of the higher formants (F3, F4, and F5). The singer’s format makes the timbre of the voice more distinct. Also the lower formants are effected; in front vowels ([i:] and [e:]) F2 is significantly lower in singing than in (neutral) speech.

Scherer (1995) discusses the relationship between emotional expressions in singing and music in his article “Expression of emotion in voice and music”. In the article he also considers the relationship between the two rather different oral signal systems, speech and singing. In speech, changes in “fundamental frequency, formant structure and of the glottal source spectrum” are used to communicate syntax, semantics, prosody and emotional state as well as information about the speaker (e.g. age, gender, origin). In the same way, melody, harmony, temporal patterning, rhythm, and articulation are used to communicate emotions through music. There are however some fundamental differences between speech and singing (vocal music). Studying and interpreting singing is usually more complicated than studying speech, according to Scherer. Contrary to speech, which is “used systematically for interpersonal communication” (Scherer, 1995, page 242), most singing activities usually involve artistic ambitions to some extent. Hence, the artistic aspects of the singing performance may limit the range and intensity of the emotions experienced by the singer during the performance. Furthermore, the emotional expression in singing is usually defined and constrained by the composer’s artistic intentions (the musical score). In other words, singing is a second-hand expressional form in the sense that the singer is interpreting the emotional intensions of the composer, rather than his or her own emotions.

It should be pointed out that all the factors mentioned above (emotional state of the speaker or singer, the characteristics of the emotional cues and the speaker’s or singer’s interpretation and utilization of these cues) effect both speech and singing, yet in different ways and to a different extent (Scherer, 1995).

Some of the differences between speech and singing derive from differences in aim and functions. While the main objective of speech is intelligibility, singing is usually more focused on expressing and inducing emotions, similarly to other artistic modalities. Therefore, singing normally has a more excessive nature than speech, i.e. greater pitch variability and a wider dynamic range. Another important difference between speech and singing is the structure. Usually singing is based on some kind of musical score that sets the conditions and limits for the singing performance. Hence, especially in case of professional singers, the singing performance is normally more ordered, with a well-controlled tempo and rhythm as well as a clear, almost exaggerated, articulation and emphasis of important words. Still, it should be pointed out that especially in the case of expressive singing deviations from the nominal durations and tempo variations are quite commonly used to emphasize the emotional characteristics of the song. (Sundberg, 2000)

In the article “Speaking vowels versus singing vowels” Itze (YYYY[1]) mentions three features that distinguish singing from speech; (1) singing, especially during a performance, requires a wider dynamic range to produce a louder sound (megaphone effect), (2) especially at high pitches where the formants are not equally loud it is important to for a singer to balance the loudness across the phonemes, (3) different techniques for achieving particular vocal qualities, e.g. nasalization, may effect other aspects of the sound and therefore require balancing to some extent. This aspect points out one important aspect of singing; there is a big variety of different singing techniques, ranging from speech-like singing (musicals) to vocal music based purely on sounds. Hence, it is rather difficult to talk generally about differences between speech and singing.

In their article “Expression, Perception, and Induction of Musical Emotions” (2004) Juslin and Laukka point out that one of the most crucial goals for future research within the field of emotional communication in music is synthesis. The authors point out that many of the techniques used in synthesis of emotions of speech could be used in synthesis of emotions in music. In other words, music seems to be related to speech to some extent. Since singing (vocal music) and music have many common features, these similarities most likely concern speech and singing as well.

Further, Juslin and Laukka point out that the difference between composer-related features, such as mode, consonance/dissonance, and performer-related features, such as tempo, loudness, and timbre, as an important aspect in the context of comparing music and speech. Many of the performer-related features can also be identified in speech. Studies show that performers primarily use “the same emotion-specific patterns of acoustic parameters that are used in emotional speech” (Juslin & Laukka, 2004, page 221). In addition, speech prosody can help to explain some of the emotional features, such as grouping and differentiation, of melodies. On the contrary, notation-related features such as harmony, tonality, and melodic progression seem to be related to more individual artistic aspects and therefore vary between cultures.

b. In the article “Expressiveness of a marimba player’s body movements” (2004) Dahl and Friberg discuss the relationship between music and body movements. In the introduction to the article the authors refer to a study of speech production conducted by McNeill et al. (2002) in which the results indicate an intimate relationship between speech and body gestures. According to this study movement gestures and speech have the same semantic origin; they are “co-expressive” rather than subordinate to each other (Dahl & Friberg 2004). Further, the authors refer to a study carried out by Juslin and Laukka (2003) that show that speech and music have many common characteristics. Similarly, Scherer (1995) points out that “all expressive modalities, particularly body posture, facial features, and vocalization, are involved in emotional communication. In other words, it seems like all of these expressive modalities are intimately intertwined and serve specific commuication functions. They complement and reinforce each other in order to provide more robust emotional cues.

One basic condition for being able to compare emotional communication in speech, music and body movements is identifying the most important cues in each category. Juslin and Laukka (2003) have compiled a set of musical features. According to this compilation a happy performance (positive valance and high activity) is among other things characterized by a fast mean tempo, small tempo variability, high sound level and pitch, staccato articulation, large articulation variability, and fast tone attacks while sad music (negative valance and low activity) is characterized by slow mean tempo, small tempo variability, low sound level and pitch, legato articulation, small articulation variability, and slow tone attacks. In a similar way, Boone and Cunningham (2001) have identified a number of central cues for body movements: frequency of upward arm movement, the amount of time the arms were kept close to the body, the amount of muscle tension, the amount of time an individual leaned forward, the number of directional changes in face and torso, and the number of tempo changes an individual made in a given action sequence. Anger and happiness are usually characterized by large movements. Variations in tempo are accordingly related to variations in gesture speed. In addition anger is usually fairly faster and jerkier (less smooth) than happiness. A sad performance is however characterized by relatively small, slow, smooth and regular movements. Sad movements are inclined to be small, fast, and jerky. (Dahl & Friberg 2004)

According to Dahl and Friberg there is a direct and intimate relationship between the features in musical performance and body movements; (1) tempo in musical performance can be compared to gesture rate (speed), (2) sound level to gesture size (amount), (3) staccato articulation to fast gestures with a resting part (fluency), and (4) tone attack to initial gesture speed. (5) Movement regularity can be compare to the general tempo variations, while (6) variations in articulation can be associated with variations in gesture rate and the duration of the resting parts.

In a study conducted by Scherer (1995), the role of the fundamental frequency, contour variables and range, intensity, duration, and accent structure in vocal communication has been examined. The result of the study indicates strong direct effects of all of these variables. Still, the fundamental frequency seemed to be the strongest cue. According to the study a narrow fundamental frequency range indicated sadness or neutrality. A wide range expressed high negative arousal, such as annoyance or anger. Further, high intensity indicated negative affects and aggression. A fast tempo (short voiced segment duration) was correlated with joy; while a slow tempo (long duration) was correlated with sadness. Anger is usually expressed by an increase in mean fundamental frequency and mean intensity. In addition, increases in high-frequency energy, downward-directed fundamental frequency contours, and a relatively high articulation rate seem to communicate angry emotions. In sadness, mean fundamental frequency, the frequency range and the intensity all decrease. Scherer also refers to another study, conducted by Kotlyar and Morozov. According to this study there are similarities between emotional expressions in singing and speech; “fast tempo for fear, slow for soorow, high vocal energy for anger”. There are however some differences, such as “longer pauses between syllables for fear compared with other emotions” (Scherer, 1995).

One interesting aspect of the different communication channels discussed above is the variety in “dominant” emotions, i.e. the emotions that are easiest to express and induce in others. When it comes to speech, studies presented by Scherer (1995), indicate that sadness and anger are easiest to convey, whereas more complex emotions such as fear and disgust are significantly more difficult to express (Scherer, 1995, page 237). Concerning music, happiness and sadness seem to be more straightforward than other emotions, while jealousy is quite complex (Juslin & Laukka, 2004). Finally, in dance fear seems to be the most difficult emotion to express. Sadness, happiness and anger were however successfully conveyed (Dahl & Friberg, 2004, page 83.

2. In the article “Time discrimination in a monotonic, isochronous sequence” (1995?) Friberg and Sundberg apply the concept of the just noticeable difference (jnd), i.e. the smallest noticeable change, for displacement of time markers in monotonic, isochronous music sequences. The jnd values are presented in two forms; absolute jnd [ms] and relative jnd [% of IOI]. After evaluating the results of a number of similar studies the authors come to the conclusion that the jnd for a perturbation in time depends on several factors, such as the interonset interval (IOI), number of time intervals included in the sequence, and type of perturbation.

The two music examples presented in this task can both be considered very fast in tempo. Still, they have different characteristics, partly due to differences in thre applied perturbations. The values (jnd) used in this section all originate from the article by Friberg and Sundberg (1995) mentioned above.

a) 170 bpm (quarter notes per minute); cyclic displacement

Background

In this musical score the displacement is cyclic, i.e. tones in stressed positions are repeatedly lengthened “at the expense of the subsequent tone in unstressed position” (Friberg & Sundberg, 1995).

Calculations

170 bpm = 170 quarter notes/minute = (170 * 2 =) 340 eighth notes/minute = (340/60 =) 5.67 eighth notes/second = 1/5.67 seconds/eighth note = 180 ms/eighth note

Conclutions

In other words the interonset interval (IOI) for this first musical peace is 180 milliseconds. This value gives a relative jnd of approximately 8.0 % of IOI = (0.08 * 180 =) 10.8 milliseconds (Friberg and Sundberg, 1995). Hence, a deviation of +/- 40 ms, as illustrated by the DM chart, should be clearly noticeable and effective (striking). I don’t recognize the piece, however based on the deviation pattern, with regular deviations, it seems like it could be a peace with a “jazzy” or “swingy” character.

b) 150 bpm (quarter notes per minute); lengthening

Background

In this case a lengthening has been applied to the deadpan score. Lengthening means that one IOI is lengthened, with the result that the following time markers arrive correspondingly late. Hence, the entire subsequent tone sequence is extended. (Friberg & Sundberg, 1995)

Calculations

150 bpm = 150 quarter notes/minute = (150 * 4 =) 600 eighth notes/minute = (600/60 =) 10.0 eighth notes/second = 1/10.0 seconds/eighth note = 100 ms/eighth note

Conclutions

In other words the interonset interval (IOI) for this second musical peace is 100 milliseconds. This value gives a relative jnd of approximately 10 % of IOI = (0.1* 100 =) 10 milliseconds (Friberg and Sundberg, 1995). Hence, a deviation of 17 ms, as illustrated by the DM chart, is barely noticeable, especially since it takes place at the end of a subsequence (just before the melody goes up) where we as listeners “expect” a delay (DiFilippo & Greenebaum, XXXX, page 81).

I think the piece is “Flight of the Bumblebee”, composed by the Russian composer Nikolay Rimsky-Korsakov.

3.

a) In the emotional space presented by Juslin (2001?) emotions are mapped using two prominent, bipolar dimensions of emotions; valence (positive-negative) and activity or arousal (high-low). According to this categorisation principle, happiness has a positive valence and a high activity, while sadness has a negative valence and a low activity. In the same way, tenderness has a positive valence and a low activity, whereas anger has a negative valence and a high activity. Fear has a slightly lower activity and more negative valence than anger.

In the case of the “Pong” computer game, a hit should be considered a positive event, while a miss should be considered a negative event. Therefore, I have chosen to enhance these events with a happy sound for “hit” and a sad sound for “miss”. In order to implement the sounds, I used parts of the PD patch “Additive” that combine eleven different sinusoidal partials into a bell-like signal. In addition, the patch allows the user to adjust frequency, delay and output sound level. The partials are then processed in the abstraction “partials” together with the values for frequency and delay inserted by the user. Thereafter, the final sum is thrown back to the original patch to be outputted. The processing in the abstraction “partials” is triggered by the two bang-objects in the main patch. The bang-objects trigger the “s trigger”-object to send a signal to the bang-object in “partials”.

The characteristics of the two sound effects implemented in the game are based on the research results of emotional expression in music (Juslin, 2001). According to these results happy music is characterized by a high pitch (frequency) and sound level as well as a fast mean tempo and a bright timbre. Sad music is on the contrary characterized by a low pitch and sound level, usually in combination with a slow mean tempo and a dull timbre. I have chosen to focus on pitch and tempo (duration) in this particular case. As seen in the attached screenshot, the sound effect for “hit” has a higher pitch and shorter duration (100 Hz, 30 ms) than the corresponding sound effect for “miss” (50, 40).

b)

According to Juslin and Laukka (2004) “both the mean level of a cue and its variability throughout the performance” may be important for the emotional communication process. Research results indicate that tempo, sound level and articulation are the most important cues for the activity level, whereas timbre in combination with sound level influences the emotional valence. Extreme values, i.e. a very high or very low timbre or sound level usually give the listener a negative impression. These results suggest that both aspects (mean value and variability) should be implemented in the dynamic soundtrack in order to give the user a robust and reliable emotional impression. Since calculating mean values in PD is a complex procedure, I have chosen to disregard these possibilities.

In order to map the total score, game activity and sudden large jumps, based on the two-dimensional space of valence and activity I have implemented a number of basic math objects in the PD patch. The total score is directly connected to valence; a high total score gives a positive valence, whereas a low total score gives a negative valence. The “sudden large jumps”, which influence the activity, are represented by the values added or subtracted from the total score. A high score in either box (points added or points subtracted) results in a high activity value.

One crucial aspect that I have not managed to implement in the game patch is the overall game tempo (activity). The tempo is controlled by the subpatch “randomgamer” and is currently consistent. In case of a varying tempo, a possible enhancement would be to map the tempo to the activity outlet. In that case, a slow tempo would correspond to a low activity.

4.

Background

Nowadays is it possible to relatively accurately simulate synthesizers by using software. One crucial factor concerning these simulators is the system latency, i.e. the time between pressing a key on a keyboard and the sound being outputted by the computer. How much can the sound be delayed without spoiling the feeling or “illusion” of playing a musical instrument in real time, both from the player’s and the audience’s point of view?

The concept of system latency is discussed in the article “Perceivable Auditory Latencies” written by DiFilippo and Greenebaum (2004?). The authors have chosen to define the term system latency as “the time between human input to a system […] and system output”. According to DiFilippo and Greenebaum a user of a computer game can tolerate up to 30 milliseconds between for instance pushing a button and hearing the result (page 66). Hence, a system designer can make the system more robust and efficient by meeting the known values of the human perceptual resolution in hearing, seeing and touching. The values for tolerable latency are dependent on several factors, such as stimulus type, individual subject and experimental method. DiFilippo and Greenebaum mention three main stimuli types; clicks, gaps and duration. This report focuses particularly on duration between audio and touch events in the context of music performance.

A study conducted by Mathews and Levitin (2000), based on subjects striking a surface with a baton, audio and touch events cease to be perceived as simultaneous when there is more than 25 ms of temporal separation for audio leading touch and more than 66 ms of temporal separation for touch leading audio. It should be pointed out that the subjects were blindfolded during the experiment. The corresponding values for subjects observing the experiment on video were 66 ms of temporal separation for video leading audio and more than 42 ms for audio leading video. Thus, it seems like there is no distinguishable difference between seeing and feeling an event before hearing it.

In a study presented in Diana Deutsch’s book The Psychology of Music (second edition) a group of pianist was given direct, delayed auditory feedback through headphones while playing. The experiment shows that the delayed feedback has less shattering consequences for piano music than for human speech. This result is consistent with the known fact that experienced pipe organs players manage to compensate for the relatively lengthy delay caused by the pipe and sound propagation.

To sum up, a value of approximately 60 milliseconds is expected, depending on the settings and conditions of the experiment.

Method

A. Subjects

In order to examine how much latency is acceptable for a keyboard player using a software synthesizer, 30 professional musicians (15 female and 15 male) of varying age (25-75 years of age) and musical skills were invited to participate in the test.

B. Procedure

The test consisted of two different parts of which the first one is based on latency in audio and touch (latency acceptance amongst keyboard players) and the second one on vision, audio and touch (latency acceptance amongst observers). Consequently, the subjects were divided into two groups of 15 persons each. The first group consisted only of piano players, while the other group consisted of mixed musicians.

The task of the group participating in the first part of the test was to play the first phrase of the Swedish song “Ekorrn” on a Yamaha synthesizer keyboard, connected to a Power Mac G5. The stimuli were presented to the subjects over a pair of loudspeaker in a damped studio. The latency was controlled directly by the computer using PD. Due to a rather low number of simultaneous events the innate timing accuracy of the experimental setup was limited to ± 0.5 milliseconds. The tests were all filmed with a high quality digital video camera.

In order to measure the latency microphones connected to a recorder were places both next to the synthesizer and to the loudspeakers. The latency was then estimated from the time difference between the sound produced by the keyboard when pressing the key and the following sound produced by the loudspeakers.

The test consisted of ten pre-determined delay values (steps) with a delay of either10, 30, 40, 50, 60, 70, 100, 300, 600, or 1000 ms. Since a value of about 60 ms was expected, the test was mainly concentrated to values adjacent to this particular value. Each step was repeated three times; thus, the test contained a total of (10 * 3) = 30 stimulus presentations. For each stimuli presentation the subjects were asked to estimate whether the audio and touch events were simultaneous. There were no particular instructions on how to listen to the sound or how to estimate the general impression.

After the first part of the test had been conducted, the second group of subjects was involved in an observation study, based on the video material from the first part. Each subject was told to watch the video clips of the keyboard players, while listening to the delayed sound. In accordance with the first test, the subjects were then told to estimate whether the audio and touch events were simultaneous.

Analysis

For each subject, the suggested limit for acceptable latency is identified.

-----------------------

[1]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download