VON Talk - Recursos VoIP



FACTORS IN THE SUCCESS OF VOICE QUALITY IN CONVERGING TELEPHONY AND IP NETWORKS

Network managers can track the various technology factors that now interfere with delivering superior VoIP services

By Stefan Pracht

The Internet’s ubiquitous growth in the last 10 years has pioneered new communications mediums like email, teleconferencing and voice over Internet protocol networks (VoIP). But despite increasing universal integration, the Internet’s audio and visual transport quality remains below the benchmark set by television and the plain old telephone systems (POTS). This paper will focus on the critical aspects facing voice quality in a converged telephony and IP network, including influencing factors and network impairments that affect transmission quality, their causes, and what can be done to improve the voice connection’s quality.

Traditional telephony networks are built to provide an optimal service for time-sensitive voice applications requiring low delay and jitter. Telephone networks provide constant but low bandwidth services. However, Internet protocol (IP) networks are built to support non-real-time applications (file transfers or emails) that are characterized by bursty traffic characteristics with occasional high bandwidth demand and longer delays.

Converging telephony and IP networks demand that IP networks be enhanced with mechanisms that ensure the quality of service (QoS) required to carry VoIP. High QoS is especially important considering that traditional telephony network users are used to high voice quality standards. Providing service quality comparable with traditional telephone networks will drive the initial acceptance and success of VoIP services. Consider the following table demonstrating the primary factors that influence user perception of phone service quality.

|Service Quality |Voice Quality |

|Offered services – Calling card, 1-800/900 services, |Traditional PSTN |In addition in IP Networks |

|follow-me, voicemail, etc. |Loudness |Delay |

|Reachability of users in other countries or regions |Delay |Delay-Jitter |

|Availability – Down time, busy signals |Echo |Clarity |

|Reliability – e.g., dropped calls, wrong number |Clarity |Packet Loss |

|Price |Intelligibility |Bandwidth Availability |

| |Noise |Compression |

| |Fading | |

| |Cross Talk | |

1: User Quality Perception - the Influencing Factors

Service quality describes the features of the offered service, but not its value, which is determined by the demand of individual user groups. For example, frequent travelers might consider the value of calling cards and follow-me features higher than families with teenage children, which may consider multiple mailboxes for the same phone number a greater value.

1 Clarity

Clarity can be described as speech intelligibility, indicating how much information can be extracted from a conversation. Speech intelligibility depends on a large variety of factors, only a few of which are well understood. For example, certain frequency bands are more important for intelligibility than others: 250-800Hz is less important for speech intelligibility than 1000-1200Hz. Intelligibility also depends on speech content. Complete sentences are usually much better understood than a sequence of unrelated words due to the logical word flow in a sentence. Figure 1 below illustrates the influencing factors for clarity in a combined IP/telephony network. The figure is an example of a typical VoIP connection between a phone connected to the public switched telephone network (PSTN) and a VoIP terminal connected at an IP network.

The different network components all have an impact on voice clarity:

• The PSTN phone influences clarity through the quality of the loudspeaker and microphone, loudness, and the acoustic echo generated between the loudspeaker and microphone.

• The PSTN network uses digital voice transmission for greater efficiency in the backbone. This requires digitizing the analog voice signal, which affects clarity.

• The VoIP Gateway connects the PSTN with the IP network and adopts voice and signaling schemes. Gateway components such as speech codec, silence suppression mechanism, and comfort noise generator gateway affect clarity.

• In addition, the IP network, even without active voice components, affects clarity through its tendency to lose packets and to add extensive jitter and delay to the signal.

• The H.323 PC terminal also affects clarity through its speech codec, silence suppression mechanism, and microphone and loudspeaker quality.

1 Packet loss

Packet loss is not uncommon in IP networks. As the network, or even some of its links, becomes congested, router buffers get filled up and start to drop packets. Another cause can be route changes due to network links going down. An effect similar to packet loss occurs when a packet experiences an extended delay in the network and arrives too late to be used to reconstruct the voice signal. For non-real-time applications, such as file transfers, packet loss is not critical – the protocol allows retransmission to recover dropped packages. However, real-time, voice information has to arrive within a certain time window to be useful to reconstruct the voice signal. Retransmission would add extensive delay to the reconstruction and would cause clipping or unintelligible uttering.

To avoid packet loss for real-time applications, mechanisms are required in the IP network to assure minimum throughput for selected applications. These mechanisms will minimize packet loss, as well as delay, for higher priority traffic such as voice. Different router mechanisms can be utilized to meet this objective. These include various prioritization schemes, such as weighted fair queuing (WFQ), and router flow control mechanisms, such as the Internet engineering task force’s (IETF) multi-protocol label switching (MPLS) tagging scheme or use of type of service (ToS) bits in the IP header. All these mechanisms require prior configuration by a network administrator who must decide what priority and resources to provide for each specific service class. A more dynamic alternative for assigning resources is the resource reservation protocol (RSVP), which permits a voice terminal or voice gateway to request a specific IP QoS.

Regardless of which is used, a deeper problem remains. Quality of service is defined on an end-to-end basis, and therefore requires that sufficient network resources be provided throughout the network path. This is not an overwhelming issue for an enterprise network or single Internet service provider (ISP) environment where all resources can be administered through one network manager. However, it is almost impossible to administer today when multiple ISPs are involved, as is the case in virtually every national or international long distance connection. In addition, fulfilling the end-to-end QoS definition assumes that all routers in the network are equally capable of identifying voice traffic and providing the network resources required. This is still the exception rather than the rule in today's IP networks because standards for many of these mechanisms have not been finalized and implemented by equipment manufacturers.

2 Speech codecs

A speech codec transforms analog voice into digital bitstreams, and vice versa. In addition, some speech codecs also use compression techniques, removing redundant or less important information to reduce the transmission bandwidth required. Essentially, compression is a balancing act between voice quality, local computation power, delay, and network bandwidth required. The greater the bandwidth reduction, the higher the computational cost of the codec for a given level of perceived clarity. In addition, greater bandwidth savings generally cause higher computational delay and therefore increase significantly the end-to-end delay. The network planner must make an informed tradeoff between bandwidth, voice quality, and delay. Furthermore, low-bit rate speech codecs such as G.729 and G.723.1 try to reproduce the subjective sound of the signal rather than the shape of the speech waveform. This means any lost or severely delayed information can have a much more noticeable effect on clarity than with a higher-bit rate speech codec.

3 Silence suppression

A voice activity detector (VAD) is a speech gate. When the caller is talking, the VAD gate opens and voice packets are transmitted. When the caller is silent, the gate is closed and no packets are sent. Since human conversations are essentially half-duplex in the long term, the use of VAD can realize approximately 50 percent reduction in bandwidth requirements, over an aggregation of channels. Figure 2 depicts the behavior of a VAD and its parameters.

Figure 2: Voice Activity Detector (VAD) behavior

4 Comfort noise generation

A comfort noise generator (CNG) is a receive-side device, which is a complement to the transmit-side VAD. During periods of transmit silence, where no packets are sent, the receiver has a choice of what to present to the listener. Muting the channel (playing absolutely nothing) gives the listener the unpleasant impression that the channel has gone dead. A CNG at the receive side generates a local noise signal for presentation to the listener during transmit silence periods. The match between the generated noise and the actual background noise transmitted during the holdover time determines the quality of the CNG.

2 End-to-End Delay

Delay is the time required for a signal to traverse the network. In a telephony context, end-to-end delay is the time required for a signal generated at the talker’s mouth to reach the listener’s ear. The end-to-end delay is the sum of the delays at the different network devices and across the network links through which the voice traffic passes. Many factors contribute to end-to-end delay.

1.2.1 Telephony network

Telephony network delay is primarily determined by the transmission delay on long-distance trunks. The delay is especially high when satellite links are involved; a geostationary satellite link has a transmission delay of about 250 milliseconds (ms). In addition, there is switching delay in the network nodes, which is small when compared to the transmission delay. Telephone networks are usually already well tuned to low delay.

1.2.2 IP network

IP network delay is primarily determined by the buffering, queuing, and switching or routing delay of the IP routers. Packet capture delay is the time required to receive the entire packet before processing and forwarding it through the router and is determined by the packet length and transmission speed. Using short packets over high-speed trunks can easily shorten the delay. Switching/routing delay is the time the router takes to switch the packet, as it analyzes the packet header, checks the routing table, and routes the packet to the output port. The amount of time is based on the architecture of the route engine and the size of the routing table. New IP switches can significantly speed up the routing process by making routing decisions and forwarding traffic in hardware instead of software components.

Furthermore, the statistical multiplexing nature of IP networks and the asynchronous nature of packet arrivals require queuing time at the input and output ports of a packet switch. This delay is a function of the traffic load on a packet switch, the length of the packets, and the statistical distribution over the ports. Over-provisioning router and link capacities can reduce but not completely eliminate the delay.

1.2.3 VoIP devices

VoIP gateways and VoIP terminals also contribute significantly to delay. Voice signal processing at the sending and the receiving sides adds to the delay. It is the time required to encode or decode the voice signal from analog, or already digital, into the voice-coding scheme chosen, and vise versa. Some codecs also compress the voice signal, extracting redundancy, which further increases the delay due to the computation necessary. The higher the compression, the more voice bits that need to be buffered; and the more complex the processing, the longer this delay component.

On the receive side, voice packets have to be delayed to compensate for variation in packet inter-arrival times. Packets generated with constant spacing over time (constant interval) will generally arrive at the receiver with spacing randomly distributed. The measure for this variance is called jitter and is due to the different buffering and queuing times packets experience in the IP network. Jitter smoothing is required because the speech codec requires a constant flow of data without gaps. This delay component can be reduced by designing a network with a lower delay jitter at each node and with as few nodes as possible. Using mechanisms that prioritize voice traffic over other traffic in the network can significantly reduce the jitter.

At the transmit side, packetization delay, or the time to fill a packet with data, is another factor. The longer the packet size, the more time is required. Using shorter packet sizes can shorten this delay, however it will increase the overhead because more packets have to be sent, all containing similar information in the header.

How much delay is too much? Delay does not affect intelligibility, rather, it affects the character of a conversation up to the point where no conversation is possible at all. Below 100ms, most users will not notice the delay. Between 100ms and 300ms, users will notice a slight hesitation in the partner's response. This can affect how the partners perceive each other's mood, giving the impression of a rather "cold" conversation as more delay is experienced. Interruptions are more frequent and the conversation gets out of beat. Beyond 300ms, the delay is obvious to the users, and they start to back off to prevent interruptions. The shorter the end-to-end delay, the better the perceived quality.

3 Echo

Echo is the reflection of a signal through the network with enough delay to be perceptible to the listener. Echo with a delay of 28ms is called sidetone and is even desired. It is reassuring for talkers to hear their own voices in their earpieces while talking. However, echo with a delay in excess of about 32 ms can be annoying to the speaker.

At the local telephone exchange, the two-wire local loop is connected to a four-wire trunk by a device called a hybrid. The hybrid separates the send- and receive-paths in order to carry them on separate pairs of wires. Because the separation of send and receive paths is not perfect, some of the receive signal leaks onto the send-path and generates an echo.

Another source of echo is the handset of a phone or the hands-free set of a phone or PC terminal. They can cause an acoustic echo, which is the result of poor voice coupling between the microphone and the loudspeaker or earpiece.

Echo cancelers are deployed to remove unwanted echo. They monitor speech from the far end that passes through its receive-path and use this information to compute an estimate of the echo that is then subtracted from its send-path. Echo cancelers are deployed closest to source of the echo, the local loop (also called tail end) or the handset. They are therefore deployed in the local exchange, the VoIP-Gateway or the VoIP PC terminal.

The quality of echo cancelers can be determined by comparing:

• Convergence time – time required for echo canceler to adapt to local tail circuit and provide adequate echo reduction;

• Cancellation depth – reduction in echo strength achieved, measured in dB; and

• Double-talk robustness – ability of echo canceler to continue working under conditions of simultaneous talking from both ends of the connection.

4 Testing Voice Quality

One common, although not always useful, approach to testing voice quality is to use traditional telephony network techniques, comparing waveforms on a screen and doing signal-to-noise ratio (SNR) and total harmonic distortion (THD) measurements. SNR, THD, and other linear measurements are useful only in certain cases because they assume that any change in the waveform indicates an unwanted network distortion. However, when using especially low-bit rate speech codecs such as G.729 and G.723.1, this is not the case. Because low-bit rate codecs try to reproduce the subjective sound of the signal rather than the shape of the speech waveform, different test methods are required for voice traffic traversing IP networks.

One method is the perceptual speech quality measurement technique (PSQM.) The PSQM method, as defined by ITU-T Recommendation P.861, objectively analyzes speech with a bandwidth of 300-3400Hz. Simply put, PSQM is an automated human listener – it compares reference speech with test samples recorded after passing through a network. Applying auditory and cognitive models to calculate the perceptional degradation, PSQM obtains distortion scores, which correspond strongly to the scores reported by human listeners. To test the quality of echo cancelers, ITU-T has defined methods in G.165 and G.168 -- objective metrics using white noise or pseudo-speech signals. Additional test tools are required that specifically address the challenges of carrying VoIP networks. Comprehensive test solutions are actively being developed to accelerate the successful deployment of voice over IP.

Stefan Pracht is the Product Marketing Manager for Voice and Fax over IP at Agilent Technologies (formerly Hewlett-Packard). Focusing on digital media communications and telephony test and analysis systems, Mr. Pracht has developed cable modem and IP analysis system strategies for a range of Agilent products. He helped develop Agilent’s VoIP business and is managing the marketing of its Telegra line of fax and voice test and analysis products. Mr. Pracht holds a Bachelor of Science in Telecommunications from the University of Dieburg, Germany. He can be contacted at 719-531-4524 or via email at stefan_pracht@.

-----------------------

Figure 1 - PSTN-Phone to VoIP-PC-terminal connection

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download