Software implementation of RTP standard using DirectShow



Implementation of Video Streaming Framework by RTP and DirectShow

S.O. FATEMI ,M. HAJIBEGLOO, M.J.GHASEMI

Electrical and Computer Engineering Dept., University of Tehran, Tehran, Iran

University of Tehran, Institute of Electro technique. Compos NO.2, North Kargar Avenue, Tehran. P.O. Box 14395/515 Tehran I.R of Iran.

IRAN

Abstract: Multimedia streaming is becoming increasing popular. Since the Internet was designed for computer data communication, satisfying the different characteristics and requirements of multimedia streams poses significant challenges. For effective and efficient Internet video streaming, many issues (e.g., multicast transmission, error control, synchronization, etc) must be addressed. In this paper, a framework for Internet video streaming is proposed. We have implemented RTP Protocol for steaming video. We propose a framework for Internet video streaming and also a scheme to embed video information into RTP packets. DirectShow architecture is selected as our base framework. DirectShow is Microsoft’s architecture for capturing and presenting of multimedia data.

Key-Words: RTP, RTCP, Video Streaming, Video Packet, Packetization, Direct show.

1 Introduction

Recently a number of protocols have been proposed to provide proper support for multimedia delivery over IP network. The Real-Time Transport Protocol (RTP) [1] provides end-to-end network transport functions suitable for applications transmitting real-time data, such as audio, video or simulation data, over multicast or unicast network services. The services provided by the RTP include payload type identification, sequence numbering, time stamping and delivery monitoring. RTP typically runs on top of user datagram protocol (UDP) to make use of its multiplexing service.

RTP does not address resource reservation and does not guarantee quality-of-service for real-time services. The data transport is augmented by a control protocol (RTCP) to allow monitoring of the data delivery in a manner scalable to large multicast networks, and to provide minimal control and identification functionality. RTP and RTCP are designed to be independent of the underlying transport and network layers.

The paper is organized as fallow: section 2 reviews RTP protocol. Section 3 presents the DirectShow architecture. In section 4 DirectShow RTP, which is our proposed streaming framework, is presented. In section 5, our packetization scheme for video frame is described followed by our simulation results conclusion.

2 Real Time Transport Protocol

Real time transport protocol (RTP) is an IP-based protocol providing support for the transport of real time data such as audio and video streams .RTP is primarily designed for multicast of real time data, but it can also be used in unicast. RTP went through several sates of internet Drafts and finally was approved as a proposed standard on November 22, 1995 by the IESG.

Multimedia applications require appropriate timing in data transmission and playing back. RTP provides timestamping and sequence numbering to take care of time problem. Through those mechanisms, RTP provide end-to-end transport for real time data over networks.

Sequence numbers are used to place the incoming data packets in the correct order. They are also used for packet loss detection. Timestamping is the most important information for real time applications. The sender sets the timestamp according to the time; the first byte in the packet was sampled. Timestamps increase by the time taken by the packets. After receiving the data packets, the receiver uses the timestamp to reconstruct the original timing in order to play out the data packet at the correct rate. Timestamp are also used to synchronize different streams with timing properties, such as audio and video data in MPEG. However, RTP itself is not responsible for synchronizing. This has to be done on the application level.

3 DirectShow architecture

DirectShow [2] is designed to provide the underlying multimedia streaming functionality for many applications. It is used for all aspects of multimedia computing including capture, encode/decode, playback, and storage of audio and video data. DirectShow introduces four primary abstractions for manipulating multimedia data. These abstractions are termed filters, pins, media samples, and media types.

DirectShow filters are used to encapsulate one or more tasks related to multimedia streaming. Examples of DirectShow filters include the video capture filter, which is used to control a video camera device, outputting raw RGB video frames, and the H.261 codec filter, which is used to compress raw RGB video buffers into H.261 frames. Filters are also provided for rendering audio and video to local devices. In DirectShow, an application performs any task by connecting chains of filters together, so that the output from one filter becomes the input for another. A set of connected filters is called a filter graph. For example, the following diagram shows a filter graph for playing an AVI file.

[pic]

Figure 1: A DiectShow filter graph

In Figure1, the File Source filter reads the AVI file from the hard disk. The AVI Splitter filter parses the file into two streams, a compressed video stream and an audio stream. The AVI Decompressor filter, decodes the video frames. The Video Renderer filter draws the frames to the display. The Default DirectSound Device filter plays the audio stream.

Direct show Filters can be grouped into several categories namely: A source filter introduces data into the graph. The data might come from a file, a network, a camera, or anywhere else. A transform filter takes an input stream, processes the data, and creates an output stream. Encoders and decoders are examples of transform filters. Renderer filters sit at the end of the chain. They receive data and present it to the user. For example, a video renderer draws video frames on the display; an audio renderer sends audio data to the sound card, and a file-writer filter writes data to a file.

Filters are connected together via pins. Pins are responsible for negotiation of the media type (see below) and memory allocator to use for filter interconnection. Media type negotiation is the means of determining the media type (see below) that will govern the format of data exchanged between two filters. Memory allocators are used to specify where memory used to contain multimedia. Buffers (also called media samples) will be allocated and what the characteristics of that memory will be (e.g., byte alignment, use of special regions of memory from memory mapped devices, etc.) Once a connection has been successfully negotiated, a filter receives and delivers media samples to its pins, which in turn implement the actual means whereby the samples are delivered to the next filter in a filter graph.

The media type is a universal and extensible way to describe digital media formats. When two filters connect, they agree on a media type. The media type identifies what kind of data the upstream filter will deliver to the downstream filter, and the physical layout of the data. If two filters cannot agree on a media type, they will not connect. In addition to the multimedia data they contain, media samples also contain start and end time stamps, which specify the life of the sample. These values are used by renderers to determine when a sample should be played (i.e., rendered) and to detect performance problems. DirectShow media types specify the format of the data contained in media samples. Media types include several fields; the most important of which are the major and minor type fields. A major type is typically used to differentiate formats according to high-level semantic guidelines. MAJORTYPE_VIDEO and MAJORTYPE_AUDIO are two examples of major types. Minor types typically specify encoding differences examples include MINORTYPE_AUDIO_G711A and MINORTYPE_AUDIO_G723. If pins from two filters are able to find a media type in common to the filters during negotiation, then a connection is possible. DirectShow allows definition of new filters, pins, and media types. Taking advantage of the built-in extensibility of DirectShow, we create a new framework for streaming video called DirectShow RTP.

4 DirectShow RTP

We propose DirectShow RTP as our framework for video streaming application. DirectShow RTP defines a set of filters and two new media types for DirectShow that provide support for Network multimedia streaming using the RTP protocol.

We implemented four filters include RTP Source, Sender Packet Handler (SPH), RTP Render and Receiver Packet Handler (RPH). We also define RTP_MAJORTYPE as our major media type and RTP_MINORTYPE as our minor media type.

The RTP Source filter is implemented to receive RTP and RTCP packets from a single RTP session. These packets are encapsulated in media samples and delivered into a filter graph. The media type advertised by this filter is of major type RTP_MAJORTYPE and minor type RTP_MINORTYPE. This filter provides interfaces for specifying the network address and port number to use for receiving an RTP session.

The RTP Render filter is designed for sending RTP and RTCP packets to a single RTP session. This filter accepts incoming connections of major type RTP_MAJORTYPE and minor type RTP_MINORTYPE. It provides control interfaces similar to those found on the RTP Source filter.

The RTP Receive Payload Handler (RPH) filter is used to transform RTP packets from a single source of a fixed payload type into their corresponding unpacketized (native) form. Thus, one version of this filter takes RTP H.263 packets and produces H.263 compressed video frames. Versions of this filter have been written for many popular payload types including H.261, H.263, Indeo, G.711, and G.723. Similar to the RTP RPH filter is the RTP Send Payload Handler (SPH) filter, which is responsible for segmenting media samples produced by video or sound codec filters into RTP packets. This filter provides interfaces for specifying the maximum size of packets to produce (in order to allow for differing network MTUs) and for specifying the value to place in the PT field (to allow use of dynamic RTP PT values.)

[pic]

Figure 2 : Sending Video by RTP DirectShow

Figures 2 and 3 demonstrate how the filters defined in DirectShow RTP are used. Figure two shows a filter graph used to capture video data and send it across a computer network using RTP. This filter graph consists of a video capture filter, which outputs raw video frames, followed by a codec filter, which compresses the frames. Once compressed, these frames are delivered to the RTP SPH filter, which fragments them, producing RTP packets, which in turn are delivered to the RTP Render filter, which transmits these packets across a computer network.

[pic]

Figure 3 : Receiving video by RTP DirectShow

Figure 3 shows a filter graph used to receive RTP packets containing a video stream and play back the stream. This graph consists of an RTP Source filter, which receives the packets, an RTP RPH filter, which converts the RTP packets into compressed video frames. These filters are followed by a decoder filter, which decompresses the frames, and a video renderer filter, which displays the uncompressed frames.

5 Packetization

A typical “video packet” consists of header information for IP, UDP, RTP, the RTP payload, and the video payload. The size of those headers are 20 byte for IP header, 8bytes for UPD header, 12 bytes for RTP header, and a variable number of bytes for payload specific header. Thus minimum amount of header information for each packet is 40 bytes. There is a need to produce video packets as large as possible to gain a reasonable relationship between header information and payload. Two upper bounds for packet size have to be considered. First, transmitting more than one frame in a single is not possible due to delay constraints, and second the typical Maximum Transfer Unit (MTU) size of the internet has to be considered. The MTU must be smaller than 1500 bytes. On the Ethernet ,every IP packet bigger than 1500 byte has to split into at least two Ethernet packets, thus at least doubling the IP packet-loss rate for a given Ethernet packet loss rate. although Ethernet is no more relevant for long distance Internet connection, many router implementations still seem to use spilt/recombine algorithms at packet size larger than the MTU size[5].

Thus, maximum packet payload size is 1460 byte, or 12480 bit. Many coded frames will fit completely into one packet. For example a raw QCIF video (176*144) frame size is 25056 byte and it can fit to one RTP packet by 18 times compression. Bit rate for this video would be 124,800 bit/s. Bigger videos frames like CIF (352*288) are possible when using more compressed videos. Those numbers suggest a general rule for minimum overhead packetization that can be expressed in “on frame, one packet “[4].

However, when applying this rule, a single packet loss means the loss of the entire encoded frame. From an error resilience point-of-view it would be more desirable to divide an encoded frame into a large number of packets to keep the spatial area affected by a packet loss as small as possible, so the rule is "pack every GOB in one RTP packet".

We propose a more network friendly packetization, where the described overhead is minimized by using a smaller number of (bigger) packets. In this scheme more than one GOB are packed to into a single packet, while keeping the maximum packet size to one frame. Hence, it is preferable not to pack GOBs into one packet in the same order they are encoded. We propose a simple interleaving scheme by packing all even GOBs into one packet, and all odd GOBs into another. This leads to two packets per frame and allows the concealment of all macro blocks of a frame if only one of those two packets are lost. The frame header contains information relevant to the whole frame and appears only once per encoded frame representation, at the very start. If this frame header belongs to one packet only and is lost, critical information is lost, and decoder can’t decode the whole frame. We add a redundant copy of the frame header into the payload header of each packet. This mechanism is employed to ensure that a frame can be (partially) decoded and concealed, even when the first packet of this frame is lost.

The constant packetization overhead (consisting of the IP/UDP/RTP header, 40 bytes per packet in total) is 80 bytes per frame.

6 Simulation result

An h.263 video coder [4] is used in our simulations. Simulation results are obtained using the sequence of Foreman in QCIF resolution. Three bit rate are considered: 20 kbps and 50 kbps to simulate modem, and ISDN dialup connections to an ISP, and 150 kbps to simulate high-bandwidth connections to the Internet backbone, or for LAN connections. All bit rates include network and video packetization overhead. By splitting the video frames into GOBs, additional coding penalties are incurred from both the size of those headers themselves and form predictive coding limitations. (E.g., motion vector prediction). This has to taken into account when comparing the PSNR values to these obtained by a coder optimized for a lossless environment. Table 1 provides the packetization overheads and the resulting PSNR.

Next, we want to compare the one frame –two packet schemes to the one GOB –one packet approach. Figure 4 shows the PSNR of the different packetization schemes versus PLR for 50 and 150 kbps. Note that the bit rates include the packetization overhead.

The one GOB –one packet scheme requires an overhead of 28.8 kbps at QCIF resolution and 10 fps, which limits the lower video bit rate reasonably achievable for packetization method. Thus for low bit rate this method cannot be used. However, even for high bit rates, the proposed packetization works consistently better than a one GOB-one packet packetization scheme.

7 Conclusions

In this paper, we have proposed a framework for Internet video streaming. Also proposed a scheme for packetization video frame into RTP packets is proposed. It was shown that our approach yields good quality of the reconstructed video frame even at high packet loss rates such as 20% while keeping packetization overhead minimum.

|Transport bit|Video bit |Efficient |PSNR at 0% |PSNR at 20% |

|rate (kbps) |rate |Video bit |PLR (db) |PLR(db) |

| | |rate | | |

|Modem |20.0 |(1) 13.6 |27.1 |20.9 |

|33.0 | |(2) N/A |N/A |N/A |

|ISDN |50.0 |(1) 43.6 |30.0 |23.6 |

|64.0 | |(2) 21.2 |28.1 |20.7 |

|LAN |150.0 |(1) 413.6 |34.1 |27.6 |

|>150.0 | |(2) 121.2 |33.7 |25.1 |

Table 1: Example of possible PSNR degradation due to packetization methods at QCIF resolution and 10 fps :( 1) using two packets per picture and (2) using one packet per GOB for the sequence Foreman.

[pic]

Figure 4: Relation performance of the two packetization schemes for the sequence Foreman at different bit rates and PLRs(Packet Loss Rate).

References

[1] H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson. RTP: A Transport Protocol for Real-Time Applications. RFC 1889, January 1996.

[2] Microsoft Corporation. DirectShow Online Documentation.

[3] C.Bormann, L.Cline, G. Deisher, T . Gardos , C.Maciocco "RTP payload format for the 1998 of ITU-T Rec. H.263 video(H.263+)" RFC 2429 may 1998

[4] ITU T Rec. H263, version2,"Video Coding for Low Bitrate Communication" Jan 1998.

[5] M.Handley,"An examination of mbone performance" UCL/ISI Research Report,Jan 1997.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download