Report .edu

1

VLSI Design of a 4-port Crossbar Switch with priority

Final Report

Abid Ali (abid@cs.wisc.edu) Ross Dickson (dickson@cs.wisc.edu)

Harit Modi (harit@cs.wisc.edu)

We describe the design of a 4-port crossbar switch based on virtual-output queueing with source-routing and dual-priority packets. We have implemented virtual queues using a shared buffer and a set of eight linked lists for each input port. The linked list has been implemented as a set of head and tail pointers in a table and a next pointer associated with each element in the data buffer. The scheduler, which decides which packet to send next, is based on a variation of the 2-dimensional round-robin (2DRR) diagonal routing algorithm. Our router is based on a 2-stage pipeline with a biphase clock.

Section 1. Introduction

We describe the design and initial testing of a four port network switch. Our switch is based around a simple crossbar with virtual-output queueing. We use a simple source-routed packet format that supports two priority levels. Packets are buffered in virtual-output queues at each input port. The switch is governed by a scheduler, which routes packets through the switch and is based

2

on a variation of the 2-dimensional round-robin (2DRR) diagonal routing algorithm.

Section 2. System Overview

Our goal was to create a chip which can receives packets on four input ports and send packets on four output ports every cycle. Routing from input to output ports is based on an output number found in the packet. Packets can have either high priority or low priority. In any cycle, multiple packets cannot be sent from the same input port and multiple packets cannot be sent to the same output port. Packets which fail a parity check are dropped.

Our chip's interface to the world is quite simple, four input ports, four output ports, clock, reset, power, and ground. Each data port is 13-bits wide supporting a simple single flit packet plus a data valid signal. The packet consists of a 4-bit payload, 6 routing bits, a priority bit and a parity bit. The extra signal line used to specifying whether the other pins contain valid data or not. The clock signal runs at twice the data rate on the data ports. Input ports latch data on even numbered rising edges of the clock, while outputs are available for latching on odd rising edges. When the switch is uncontested it is possible that a packet could be read one rising edge made available on an output port the following rising edge.

Section 3. Switch Components

The major parts of the switch are four input ports, a scheduler and a crossbar connecting to four output ports. To simplify implementation our switch uses a 2-stage pipeline. This allows incoming packets and outgoing packets to be processed concurrently. A block diagram of the switch is shown in Figure 1. Each input port consists of a Packet Buffer and a Head-Tail Table. The Head-Tail Table allows the packet buffer to function as eight linked lists to support virtualoutput queueing.

The Head Tail Table consists of a 4-bit head pointer, 4-bit tail pointer and a valid bit. It con-

3

Input

Head - Tail

Port 1

Table

INPUTS

Packet Buffer

Input port 2 Input port 3 Input port 4

Packet Scheduler (Diagonal Selector)

Crossbar

OUTPUTS

Figure 1: Block Diagram of the switch tains a table of 8 elements. The Packet Buffer stores the packets which have not yet been sent out, a pointer to the next element in the queue and a valid pointer which is also used as a free list to decide where to insert the next element. The Packet Scheduler selects which diagonal needs to be sent out and informs the input ports and the crossbar.The crossbar takes an inputs from each of the four input ports and routes them to the correct output depending on the diagonal selected.

Linked List Cell: The chief purpose of this cell is to maintain the linked list associated with each of the eight queues (4 output ports x 2 priorities) for each input port. In addition to keeping the head, tail and valid bits of each linked list, this cell incorporates the Free List, a priority encoder and a few latches. The Free list is needed to generate the address of the next free entry in the data buffer and this address is used by the data buffer to insert the packet at the right spot and

4

also by the Linked List to update the head and tail pointers. The free list is essentially a bit-vector which keeps track of which elements of the data buffer are valid and which slots are free. The priority encoder takes the free list as an input and generates as output an encoding of the first free element.

The Linked List is further divided into three parts: the Head cell, the Tail cell and the Valid Cell. These consist mainly of D flip-flops and their timing is described in one of the following sections. We implemented these in standard cells, though we have a very compact layout of a D flipflop, which would save a lot of space given the number of DFFs we are using, but we decided not to spend too much time building the other control around the DFFs by hand. And, since we cannot merge custom cells with standard cells, we had to revise our implementation plans so that we did not include our custom DFFs in the design.

Data Buffer: The data buffer consists of the actual data. We have two implementations of the data buffer - a register file implementation, which was implemented using standard cells, and an SRAM version, which is implemented in custom logic and uses dynamic logic.

1. Register file implementation:The register file version of the data buffer contains two decoders, 16 registers of 10bits each (we hardcoded two bits of the data into our crossbar), and one 10bit 16x1 MUX. The decoder for writing into the register file has a enable signal in it, which is turned on only if the packet is valid, there is space in the buffer for more packets and the packet has passed parity checking.

2. Custom-built SRAM version: The SRAM version of the data buffer is very compactly laid out. The SRAM has 1 read port and 1 write port. Cell was carefully designed so that it can be butted together with other cells on each of its edge. Full custom decoder, precharge cells and decoders have been implemented. Before going for the full custom, the SRAM was tested and

5

verified at the transistor level using Accusim. SRAM and the decoders are dynamic and are pre-

charged.

Next Pointer Table: This contains the next pointer associated with each data element to form

the linked list. Similar to the data buffer, we have two implementations of this cell - a standard cell

register file version and a custom-built SRAM version.

1. Register File implementation: This is pretty much like a data buffer with the only difference

that two writes may occur simultaneously for forming link lists. Since SRAM cell was already

designed with one write port, special decoder logic was used for enabling two select lines at the

same time. e.g., Consider the scenario in which the already a packet is present in data_buffer[3].

Next_bufer[3] = 3. Note that same values get written into the head and the tail.

Index -> Next Buffer Value

3

3

Head Value Tail Value NewPtr for new packet.

3

3

X

Now assume that for some reason the packet stayed back in the queue and was not sent for-

ward. And a new packet comes for the destined for the same queue. New Ptr value is at 6 right

now. The following would happen:

Index -> Next Buffer Value

3

6

6

6

Head Value Tail Value NewPtr for new packet.

3

6

6

It can be seen that two writes happened in the Next buffer simultaneously with only one write port. This is done to simplify the logic, and the tail is always kept a self-referential pointer. This simplifies the logic for resetting the Valid cell (1->0)

2. SRAM implementation: similar to Data buffer SRAM, with the only difference being the specialized decoder, which is same as that in the Register file implementation of the next pointer table.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download