Bidirectional Pipelining for Scalable IP Lookup and Packet Classification
Both IP lookup and packet classification in IP routers can be implemented by some form of tree traversal. SRAM-based Pipelining can improve the throughput dramatically. However, previous pipelining schemes result in unbalanced memory allocation over …
Authors: Weirong Jiang, Hoang Le, Viktor K. Prasanna
Bidirectional Pipelining f or Scalable IP Lo okup and P ac ket Classification W eirong Jiang, Hoang Le and Viktor K. Pra sanna Ming Hsieh Depar tment of Electrical Engineering Unive rsity of Southern California Los Angel es, CA 90089, USA { weiro ngj, hoangle , prasanna } @usc.edu ABSTRA CT Both IP looku p and pack et classificati on in IP routers can be implemented by some form of tree trav ersal. SRAM-based Pipelining can impro ve the throughput dramatically . How - ever, previous pip elining schemes result in unbalanced mem- ory allocation ov er the pip eline stages. This has b een id en - tified as a ma jor challe nge for scalable p ipelined solutions. This pap er prop oses a fl exible bidirectional linear pip eline arc hitecture based on widely-used dual-p ort SRA Ms. A searc h tree is partitioned, and t h en mapp ed onto pip eline stages by a bid irectional fine-grained mapping scheme. W e introduce th e n otion of inversion f act or and sev eral heuris- tics to inv ert subtrees for memory balancing. Due to its linear structu re, the architecture main tains packe t input or- der, and supp orts non- b locking route up dates. Our exp er- iments sho w th at, t h e architecture can ac hieve a p erfe ctly balanced memory distribut io n ov er the pip eline stages, for b oth trie-based I P lo okup and tree-based multi-dimensional pack et classificatio n. F or IP lo okup, it can store a full back- b one routing table with 154419 entries using 2MB of mem- ory , and su stain a high throughpu t of 1.87 billion packets p er second (GPPS), i.e. 0.6 Tbps for the minimum size (40 bytes) pack ets. The throughpu t can b e improv ed furth er t o b e 2.4 Tbps, by employing cac hing to exploit the Internet traffic lo cal ity . Categories and Subject Descriptors C.1.4 [ Pro cessor Architectures ]: Parallel Architectures; C.2.6 [ Computer Communication Netw orks ]: I n ternet- w orking— R outers General T erms Algorithms, Design, Pe rformance Keyw ords P ack et classification, IP lo okup, Pip eline, T erabit, Bidirec- tional, SRA M 1. INTR ODUCTION Modern IP routers need to offer not only IP look u p for pack et forwa rding, but also a v ariet y of val ue-added func- tions, such as securit y and differentiated services. Most of those functionalities rely on packet classification where the pack ets are classified into different fl o ws according to some set of pre-d efined rules. P ack et classification generally refers to the multi-field matching. IP lo okup can b e seemed as one dimensional pack et classification where the destination IP address of a pack et is matched against a set of prefixes. On the other h and, adv ances in optical netw orking tech- nology p ose a big c hallenge on t he d esi gn of h igh sp eed IP routers. Increasing link rates demand that pac k et processing in IP routers must b e performed in hardw are. F or instance, 40 Gbps links req uire a throughput of 8 ns p er pack et, i.e. 125 million pac kets p er second (MPPS), for a min imum size (40 bytes) pack et. Such throughp ut is imp ossible using ex- isting softw are-based solutions for either IP look u p [18] or pack et classi fication [7]. Most hardware-based solutions for high sp eed pack et pro- cessing in routers fall into tw o main categories : ternary con- tent addressable memory (TCAM)-based and dynamic/static random access memory (DRAM/SRAM)-based solutions. Although TCAM-based engines can retrieve results in just one clo c k cycle, their throughput is limited by the relatively lo w clo c k rate of TCAMs. TCAMs are exp ensive and offer little fl ex ibilit y to adapt to new add ress ing and routing pro- tocols [10]. As shown in T able 1, SRA Ms offer b etter scal- abilit y than TCAMs with resp ect to memory access time, density and pow er consump tion. T able 1: Comparis on of TCAM and SRAM tech- nologies (based on 18 Mbi t c hip) TCAM SRAM Maxim um clo c k rate (MHz) 266 [16] 400 [5, 19] Cell size (# of transistors/bit) 16 6 P ow er consumption (W atts) 12 ∼ 15 [24] ≈ 0.1 [4] In DRAM/SRAM-b ased solutions, b oth IP lookup and pack et classification can b e implemented by some form of tree trav ersal. Eac h pack et trav erses a search tree in the memory , and retrieves its matched entry when it arrives at a tree leaf. Such a search p rocess needs multiple mem- ory accesses, which b ecomes a ma jor drawbac k of tradi- tional DRA M/SRAM-based solutions. Sev eral researc hers hav e explored p ipelining to improv e the throughput. A sim- ple p ipelining approac h is to map eac h tree level onto a pip eline stage with its own memory and pro cessing logic. One pack et can b e pro cessed every clo c k cycle. Ho w ever, this approac h results in unbalanced tree n ode distribution o ver the pip eline stages. This has b een iden tified as a dom- inant issue for pipelined architectures [3 ]. In an un balanced pip eline, the “fattest” stage, which stores t he largest num- b er of tree no des, becomes a b ottlenec k. It adversely affects the overa ll p erformance of the p ipeline in the follo wing as- p ects. First, more time is n eed ed to access th e larger lo cal memory . This leads to a reduct ion in th e global clock rate. Second, a fat stage results in many up dates, due to the pro- p ortional relationship b etw een the number of up dates and the num ber of tree no des stored in t hat stage. Particularly during the up date pro cess caused by intensive route/rule insertion, the fattest stage may also result in memory o ver- flow . F urthermore, since it is unclear at hardware design time whic h stage will be th e fattest, we need to allocate memory with the maximum size for eac h stage. S uc h a k in d of ov er-pro visioning results in memory w astage [1] and ex- cessiv e pow er consumption. T o balance the memory distribution across stages, sev- eral no vel pip eline architectures h a v e b een prop ose d [1, 12, 9]. H o w ev er, none of them can achiev e a p erfectly balanced memory distribution over stages. F urthermore, due to the non-linear structures they employ , most of them m ust stall subsequent pac kets d uring a route up date. W e p ropose a S RAM-based bidirectional linear pip eline arc hitecture, for b oth IP looku p and pac ket classification in IP routers. This p aper makes the follo wing con tributions: • T o th e b est of our knowledge, this wo rk is the first one to ac hieve a perfectly balanced memory allo catio n o ver pip eline stages, for b oth IP looku p and multi- dimensional p acket classificati on. The memory w astage due to ov er-provisioning is almost zero. • A bidirectional fin e-grained mapping scheme is pro- p osed to realize the ab o v e goal. W e in tro duce the no- tion of inversion f act or and severa l heuristics to inv ert the subtrees for memory balancing. • A no vel bidirectional linear pip eline arc hitecture is pre- sented to enable the ab o ve mapping. It maintains the pack et inp ut order and supp orts non-blocking upd ates. • Our sim ulation ex p eriments using real-life data demon- strate the SRA M-based pip elined arc hitecture to b e a promising solution for next generation IP routers. The prop osed arc hitecture can store a full bac kb one rout- ing tab le with 154419 entries using 2MB of memory . It can sustain a high through p ut of 1.87 billion pack ets p er second (GPPS), i.e. 0.6 Tbps for minimum size (40 bytes) pac kets. The remainder of this pap er is organized as follo ws. Sec- tion 2 reviews the background and related w orks. Section 3 discusses th e memory balancing o ver pip eline stages and prop oses a no vel bidirectional fine-grained mapping scheme. Section 4 prop oses a corresponding bidirectional linear pip eline arc hitecture. Section 5 conduct s sim ulation exp erimen ts to ev aluate the p erfor mance of our app roac hes. Section 6 con- cludes the pap er. 2. B A CKGROUND 2.1 IP Lookup and Pack et Classification 2.1.1 T rie-ba sed IP Lo okup The nature of IP looku p is longest prefix matc hing (LPM). The most common data structure in algorithmic solutions for p erfo rming LPM is some form of trie [18]. A t rie is a binary tree, where a prefix is represented by a no de. The v alue of th e p refix corresp onds to the path from the ro ot of the tree to th e no de representing the prefix. The branch- ing d ecis ions are made based on the consecutive bits in th e prefix. A trie is called a uni- b it trie if only one bit is u sed for making branching decision at a time. The prefix set in Figure 1 (a) corresp onds to the uni-bit trie in Figure 1 (b). F or example, the prefi x “010*” corresponds to the path starting at the ro ot and ending in no de P3: fi rst a left-turn (0), th en a right-turn (1), and fi nally a turn to the left (0). Eac h trie n o de contains tw o fields: the represented prefix and the p oin ter to th e child no des. By using the optimiza- tion calle d le af-pushing [21], each no de needs only one field: either the p oin ter to the next -hop add ress or t he pointer to the c hild no des. Figure 1 (c) show s t h e leaf-pushed un i-bit trie derived from Figure 1 ( b). 01001* P4 010* P3 000* P2 0* P1 111* P8 110* P7 011* P6 01011* P5 0 1 0 1 1 0 0 1 0 1 1 1 1 0 P2 P6 P4 P5 P7 P8 root (a) (c) Level 0 Level 1 Level 3 Level 2 Level 4 Level 5 P1 0 1 0 1 1 0 0 1 0 1 1 1 1 0 P2 P3 P6 P4 P5 P7 P8 root (b) P1 P3 P3 1 0 0 0 null Figure 1: (a) Prefix set; (b) Uni -bit trie; (c) Leaf- pushed uni-bit trie . Give n a leaf-pushed uni-bit trie, IP looku p is p erformed by tra ve rsing the trie according to the b its in the IP address. When a leaf is reached, the prefi x associated with the leaf is the longest matched prefix for th at IP address. The time to look up a uni-bit trie is equal to the prefix length. The use of multiple bits in one scan can increase the searc h sp eed. Such a trie is called a multi-bit trie. The num b er of bits scanned at a time is called stride . Some optimization sc hemes [6, 11] hav e b een prop osed to build a memory- efficien t multi-bit trie. F or simplicity , we consider only the leaf-pushed uni-bit trie in th is pap er, though our ideas are applicable to other forms of tries. 2.1.2 Decision T ree bas ed P acket Class ification Multi-dimensional p ac k et classification is one of the fun- damental chal lenges in designing high sp eed rout ers. It en- ables routers to sup port fi rew all pro cessing, Q u ali ty of Ser- vice differentiatio n, virtual p riv ate netw orks, p ol icy routing, and other v alue added services. An IP pac ket can b e classi- fied based on a num b er of fields in the pack et header, such as source/destination IP address, source/destination p ort num ber, t yp e of service, t yp e of proto col, etc. Fie lds are generally sp ecified by range. When a p ac k et arrives at a router, its header is compared against a set of rules. Each rule can hav e one or more fields and their associated v alues, a priority , and an action to b e taken if matched. A packe t is considered m atching a rule only if it matches all the fields within that rule. Man y pack et classification algorithms are based on de- cision trees which take the geometric view of the pac ket classification problem. H yperCuts [20 ] is a representativ e of such algorithms. At each no d e of the decision tree, th e searc h space is cut based on the information from one or more fields in the rule. H yperCuts algori thm allo ws cutting on multiple fields p er step, resulting in a fatter and shorter decision tree. The searching algorithm in a H y perCuts tree is simple. When a pac ket header arrives at the ro ot of t h e tree, it will tra verse the decision tree u n til it finds either a leaf n ode or a NULL no de. The leaf no de will represent the first matching rule, and the NULL n ode will indicate t hat no matc h h as b een found. 2.2 Memory-Balanced Pipelines Pipelining can dramatically improv e the th rou gh p ut of tree trav ersal. A straightforw ard wa y to pip eline a tree is to assign eac h tree level to a different stage, so th at a pac ket can b e pro cessed every clo c k cycle. How ev er, as discussed earlier, t h is simple pip elining scheme results in unbalanced memory distribution, leading to low throughput and ineffi- cien t memory allocation. Basu et al. [3] and K im et al. [11] b oth reduce th e memory im balance by using v ariable strides to minimize th e largest trie lev el. Ho w ever, even with their schemes, th e size of the memory of different stages can hav e a large va riation. As an improve ment up on [11], Lu et al. [13] prop ose a tree- packing heuristic to balance the memory further, but it do es not solv e the fundamental p rob lem of how to retrieve one nod e’s descendan ts which are not allocated in t he fol lo wing stage. F urthermore, a va riable stride multi-bit trie is diffi- cult for h ardw are implementation, esp ecial ly if incremental up dating is needed [3]. Babo escu et al. [1 ] prop ose a Ring p ipeline architecture for tree-b ased searc h engines. The pip eline stages are con- figured in a circular, multi-point access pip eline so that the searc h can b e initiated at any stage. A tree is sp lit into m any small subtrees of equal size. These subtrees are th en mapp ed to different stages to create an almost balanced pip eline. Some subtrees m ust wrap around if t h eir roots are mapp ed to th e last several stages. An y incoming IP pack et needs to lo okup an index table to find its corresp onding subtree’s root which is the starting p oin t of th at search. Since the subtrees may b e from different depths of th e original tree, w e cannot use a constant number of ad d ress bits to index the table. Thus, the index table must b e bu ilt by conten t addressable memories (CAMs), which may result in lo w er sp eed . Though all IP p ac k ets enter the pip eline from th e first stage, th eir lo okup pro cesses may b e activ ated at differ- ent stages. All th e pac kets must trav erse the pip eline t wice to complete the tree tra versal . The th rou gh p ut is thus 0.5 pack ets per clock cycle. Kumar et al. [12] ex t end the circular pip eline with a new architecture called Circular, A d aptiv e and Monotonic Pipeline (CAMP). It uses several initial bits (i.e. initial stride) as the hashing ind ex to partition the tree. Using the similar id ea but different mapping algorithm from Ring[1], CAMP creates an almost balanced pip eline as well. Un lik e the Ring pip eline, CAMP has m ultiple entry stages and ex it stages. T o manage the access conflicts b et wee n pac kets from current and p receding stages, severa l q ueues are emp lo yed. Since different pack ets of an inp ut stream ma y have differ- ent entry an d exit stages, the ordering of the pack et stream is lost when passing t hrough CAMP . Assuming th e pac kets tra verse all t h e stages, when t h e pack et arriv al rate exceeds 0.8 pac kets p er clock cycle, some pack ets may be discarded [12]. In other w ords, the w orst-case throughput is 0.8 pac k- ets p er clock cycle. A lso in CA MP , a queue adds extra dela y for each pack et, which may result in out- of -order ou t put and dela y vari ation. Due to the non-linear structure, neither the Ring pip eline nor CAMP in the worst case can main tain a throughput of one packe t p er clo c k cy cle. Also, n either of them supp orts the non-blo c king route up date, since the ongoing up date ma y conflict with the preceding or follo wing pack ets. Our previous w ork [9] adopts an optimized linear pip eline arc hi- tecture, n amed OLP , to ac hieve a high throughp ut of one output p er clo c k cycle, while supp orting write bubbles [3] for n on-blocking up date. By adding nop s (no- op erations) in the pip eline, OLP offers more freedom in mapping tree nod es to p ip eline stages. The tree is p artitio ned, and all subtrees are converted into q ueues and are mapp ed onto th e pip eline from t he first stage. H o w eve r, in OLP , the first sev- eral stages may n ot b e balanced, since t he top levels of a tree hav e few no des. 2.3 Discussion State-of-the-art techniques cannot ac hieve p erfectly bal- anced memory d istribution, due to several constraints th ey place during mapping: (1) They require th e tree no des on the same lev el b e mapp ed onto the same stage. (2) The mapping sc heme is u ni-directional: the subtrees p artitio ned from the original tree must b e mapp ed in the same direc- tion (either from the ro ot, or from th e leav es). Actually , b oth constrain ts are unn ecess ary . The only constraint we must ob ey is: Constr aint 1 : I f no de A is an ancestor of nod e B in a tree, then A must b e mapp ed t o a stage p receding the stage to whic h B is mapp ed. This pap er proposes a flexible bidirectional linear pip eline arc hitecture which provides a unified architecture for b oth IP looku p and pack et classification. By employing widely- used dual-p ort SRAMs, the arc hitecture allo ws tw o flows from opp osite directions to access t h e local memory in a stage at th e same time. With a bidirectional fine- gra ined mapping scheme, a p erfectly balanced memory distribution o ver p ipeline stages is achiev ed. It has many d esi rable prop- erties du e to its linear structure: (1) the worst-case through- put of one pac ket p er clo c k cycle is sustained; (2) each pack et has a constant dela y to go th rough the arc hitecture; (3) the pack et input order is main tained; (4) non-b lo c king u pdate is supp orted, that is, while a write bubble is inserted to up date the stages, b oth the su b sequen t and the an tecedent packe ts can p erform the search as we ll. 0 5 10 15 20 25 30 35 0 2 4 6 8 10 12 14 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (a) Depth-b ased mapp ing 0 5 10 15 20 25 30 35 0 0.5 1 1.5 2 2.5 3 x 10 5 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (b) Height-based mapping Figure 2: Level-b y-level mapping of routing tries on to 32 pip eline stages. 0 5 10 15 20 25 30 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Pipeline stage ID # of tree nodes Node distribution over stages fw1_100 ipc1_1k acl1_10k fw1_real (a) Depth-b ased mapp ing 0 5 10 15 20 25 30 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages fw1_100 ipc1_1k acl1_10k fw1_real (b) Height-based mapping Figure 3: Level-b y-level mapping of decision trees on to 25 pip eline stages. 3. MEMOR Y B ALANCING This section studies the problem of balancing the memory distribution across pip eline stages. W e examine tw o typical mapping approaches, and t h en prop ose a nov el bidirectional fine-grained mapping sc heme. First, w e define t h e follo wing terms. Definition 1. The pi p eline depth is the numb er of pip eline stages. Definition 2. The depth of a tr e e no de is the dir e cte d distanc e fr om the tr e e no de to the tr e e r o ot. The depth of a tr e e r efers to the maximum depth of al l tr e e le aves. Definition 3. The height of a tr e e no de i s the maxim um dir e cte d distanc e fr om the tr e e no de to a le af no de. The height of a tr e e r efers to the height of the r o ot. In fact the depth of a tr e e i s e qual to i ts height. Definition 4. In depth-b ase d (height-b ase d) mapping, two tr e e no des ar e said to b e on the same level if they have the same depth (height). 3.1 Motiv ation The most straightfo rw ard mapping scheme is d epth-based mapping [3], where the tree no des with the same depth are mapp ed onto th e same stage. In this scheme, the first stage alw a ys has one tree no de i.e. the tree ro ot. All th e pac kets enter the pipeline from the first stage. An other level-by- leve l m ap p ing scheme is h eig ht-based mapping [8], where the tree no des with the same heigh t are mapp ed onto th e same stage. I n this sc heme, all tree lea ves are mapp ed onto the first stage, and the tree ro ot is mapp ed onto the last stage. All th e packets enter the pip eline from th e last stage. W e stud y the effectiveness of the ab o ve t w o mapp ing sc hemes by conducting exp eriments on four represen tative routing ta- bles r r c 00 , r rc 01, r rc 08 and r rc 11 collected from [17]. W e also coll ected four ru le sets, f w 1 100, ip c 1 1 k , acl 1 10 k and f w 1 r eal , from [15] and built them into decision trees using the Hy perCuts algorithm [20]. A ccording to Figures 2(a-b) and Figures 3 (a-b), for b oth trie-based IP lookup and deci- sion tree based pack et classification, the no de distribution across the stages is ext remely u n balanced after using either the depth- based or the height-based mapping. 0 1 0 Stage 1 Stage 2 Stage 3 (a) Partition (c) Node-to-Stage Mapping P6 00 0 0 1 0 1 1 1 1 0 e c d f g h root 1 0 0 P2 P1 P6 P7 P8 P4 P5 P3 P3 null 01 10 11 Stage 4 P2 1 f P8 0 1 0 1 h (b) Invert 0 1 1 1 1 0 d f g h 0 0 P6 P4 P5 P3 P3 0 c 1 P2 P1 0 1 e P7 P8 P 6 P 5 P4 P3 g P5 d 0 c P1 P7 e Figure 4: Bidirectional fine -gr ained mapping for the trie in Figure 1. W e say the depth-based mapping is forwar d mapping since the mapping is b egun from the ro ot, while the height-based mapping is r everse mapping since it is b egun from the leav es. Intuitiv ely it will b e more balanced if the tw o mapp ing sc hemes can b e com bined in an effective wa y . 3.2 Bidir ectional Fine-grained Mapping T o achiev e a perfectly balanced memory distribution ov er the stages, we propose a bidirectional fi ne-grained mapping sc heme 1 , as shown in Figure 4. The main ideas are (1) fine- grained mapping : allo wing tw o trie no des on the same trie level to b e mapp ed onto differen t stages; and (2) bidi- rectional m apping : allow ing tw o subtrees to b e mapp ed onto different directions. How ever, there are several issues to b e addressed: • Partiti on th e entire tree so th at we can hav e several subtrees to be mapp ed in different d irections. • Decide whic h subtree(s) should b e inv erted and mapp ed on the reverse direction. • Adapt and com bine the d epth-based and t he height- based mapping sc hemes at eac h step. 3.2.1 T ree P artitioning W e use prefix expansion [21] to partition the tree. Several initial bits are used as th e index to partition t h e tree into many disjoin t subtrees. According to [2] and our observ a- tion on th e collected routin g tables, few p refixes in real-life routing tables are shorter than 16. Hence, there will b e little prefix duplication when we use fewer than 16 initial bits to expand the p refixes. 3.2.2 Subtree In version In a trie, there are few nodes at the top levels while there are a lot of no des at th e leaf level. Hence, w e can inv ert some subtrees so t hat their leaf nod es are mapp ed on to the first several stages. W e prop ose several heuristics t o select the subt rees to b e inv erted: 1. L ar gest le af : The subtree with th e most num b er of lea ves is p refer red. This is straightforw ard since w e need enough no des to be mapped onto t he fi rst several stages. 1 F or simplicity , in this section we describ e our scheme for the trie only . The sc heme can b e easily extended for the decision tree. 2. L e ast height : The subtree of sh ortest height is pre- ferred. Due to Constr aint 1 , a subt ree with a larger height has less flex ib ility to b e mapp ed onto p ipeline stages. 3. L ar gest le af p er height : This is a combinatio n of the previous tw o heuristics, by dividin g the num ber of lea ves of a subtree by its height. 4. L e ast aver age depth p er le af : Averag e dep th p er leaf is th e ratio of the sum of the d ep th of all th e leav es to the num b er of lea ves. This heuristic prefers a more balanced subtree. Algorithm 1 finds th e subtrees to b e inv erted, where I F R denotes the inversion factor . A larger inv ersion factor re- sults in more sub trees t o b e inverted. When the invers ion factor is 0, no subtree is in verted. When the inv ersion factor is close to the pip eline depth , all subtrees are inv erted. The complexity of this algorithm is O ( K ) where K denotes th e total number of subtrees. Algorithm 1 Selecting the subtree to b e inv erted Input: K subtrees. Output: V subtrees to b e inverted. 1: N = total # of tree no des of all subtrees, H = # of pip eline stages, V = 0, W = K . 2: whil e V < K AN D W < I F R × ⌈ N /H ⌉ do 3: B ased on the chosen heuristic, select one sub t ree from those not in verted. 4: V = V + 1, W = W − 1 + # of leav es of th e selected subtree. 5: end whi le 3.2.3 Mapping Algorithm Now we hav e tw o sets of subtrees. Those subt rees which are mapp ed from ro ots are called t he forwar d subtr e es , while the others are called the r evers e subtr e es . W e use a bidirec- tional fine-grained mappin g algorithm (Algorithm 2). The nod es are p opped out of the R eady List in the decreasing order of their priority . The priorit y of a trie no de is de- fined as its height if the no de is in a forw ard subtree, and its depth if in a reverse subtree. The no de whose priority is equal to the num b er of the remaining stages is regarded as a critic al no de. If such a no de is not mapp ed onto the cur- rent stage, some of its d escendan ts ( if in a forw ard subtree) or ascendants (if in a reverse subtree) can not b e mapp ed later. F or the forward subtrees, a no de is p u shed into the N ex tReady List immediately after its parent is p opp ed. F or the reverse subtrees, a no de will not b e push ed into the N ex tReady List u n til all its c hildren are p opped . The com- plexity of this mapping algorithm is O ( H N ) where H de- notes the pip eline dep t h and N the total num ber of trie nod es. Algorithm 2 Bidirectional fine-grained mapping Input: K forw ard sub trees and V reverse subtrees. Output: H stages with mapp ed no d es. 1: Create and initialize tw o lists: Ready List = φ and N ex tReady List = φ . 2: R n = # of remaining no des, R h = # of remaining stages = H . 3: Push the ro ots of the forwar d subtrees and the lea ves of the reverse subtrees into R eady List . 4: for i = 1 to H do 5: M i = 0, C r itical = F ALS E . 6: Sort the no des in R eady List in the d ecreasing order of the no de priority . 7: while C ritical = T RU E or ( M i < ⌈ R n /R h ⌉ and Ready l ist 6 = φ ) do 8: P op no de from R eady List and map into Stage i . 9: if The no de is in forwa rd subtrees then 10: The p opp ed no de’s c hildren are pushed in to N ex tReady List . 11: else if All children of the p opp ed no de’s parent hav e b een mapped then 12: The p opp ed no de’s parent is p ushed into N ex tReady List . 13: end i f 14: C rit ical = F ALS E . 15: if There exists a no de N c ∈ R eady List and the priorit y of N c > = R h − 1 then 16: C r itical = T RU E . 17: end i f 18: end while 19: R n = R n − M i , R h = R h − 1. 20: Merge the N extRea d yList to the R eady List . 21: e nd f or The effectiveness of th e bidirectional mapp in g scheme is ev aluated in Section 5 . 4. HARD W ARE ARCHITECTURE T o enable the bidirectional fine- grai ned mapping scheme, w e develop a bidirectional linear pipeline arc hitecture based on dual-p ort SRAMs 2 , as sho wn in Figure 5 . 4.1 Over view There is one Direction I ndex T able ( DIT), which stores the relationship betw een the subtrees and their mapp ing di- rections: forward or reverse. F or any arriving pack et p , t he initial bits of its IP address are used to lo okup the DI T and retrieve the information ab out its corresp onding sub- tree S T ( p ). The information includes (1) the distance t o the stage where the ro ot of S T ( p ) is stored, (2) the mem- ory address of th e ro ot of S T ( p ) in that stage, and (3) the mapping direction of S T ( p ) which leads the pack et to dif- feren t entrance of the pip eline. F or example, in Figure 5, if the map p ing direction is forw ard, th e pack et is sent to t he 2 Dual-p ort SRAMs h ave been standard comp onents in many devices such as FPGAs [22]. leftmost stage of the pipeline. Oth erwi se, the pack et is sent to the rightmost stage. Once its direction is kn o wn, the p ac k et will go th rough the entire pip eline in t hat d irectio n. The pip eline is con- figured as a dual-entrance bidirectional linear p ip eline. At eac h stage, t he memory has dual Read/W rite p orts so that the packe ts from b oth directions can access t he memory si- multaneousl y . The conten t of each entry in the memory includes (1) the memory address of the child no de and (2) the distance to the stage where the child n ode is stored. If the distance v alue is zero, the memory address of its child nod e will b e used to in d ex the memory in th e n ex t stage to retrieve the child no de con tent. O t herwis e, the pack et will pass th at stage without any operation but d ecremen t its distance va lue by one. 4.2 Incr emental Route Updates W e up date the memory in the p ip eline by inserting write bubbles [3]. The new conten t of the memory is computed of- fline. When an up date is initiated, a write b ubble is inserted into the pip eline. The direction of write bubb le insertion is determined by the direction of the subtree that the write bubble is going to up date. Each write bubble is assi gned an ID. There is one write bu bble table in each stage. It stores the up date information associated with the write bu bble ID. When it arriv es at the stage p rior to the stage to b e u pdated, the write bu b ble uses its ID to lo okup th e write bubb le ta- ble. Then it retrieves (1) the memory address to b e up dated in the nex t stage, (2) the new content for that memory lo- cation, and (3) a write enable b it. If t h e write enable b it is set, th e write bu bble will use th e new conten t to u pdate the memory lo cation in th e n ex t stage. Since the subtrees mapp ed onto the tw o directions are disjoin t, a write b ubble inserted from one direction will not conta minate the memory conten t for the searc h from the other direction. Also, since the pip eli ne is linear, all pac kets preceding or follo wing the write bu bble can p erfo rm their searc hes while the write bu bble performs an up date. 4.3 Thr oughpu t Impro vement by Caching In the ab o v e arc hitecture sho wn in Figure 5, at most tw o pack ets are allo w ed to enter the pip eline at the same time. The throughpu t can b e 2 pack ets p er clo c k cycle (PPC) only if t he tw o packe ts are in the tw o d isti nct directions. Usually , such a traffic b ala ncing cannot b e guaranteed in realit y . Thus, t he throughp ut is low er than 2 PPC when we insert 2 packets in one clo c k cycle. On the other hand, Internet traffic con tains a great amount of locality due to the TCP mechanism an d app lica tion char- acteristics [10]. A s sho wn in Figure 6, some small caches can b e added into the architecture to exploit the Internet t raf - fic lo calit y . The most recently searc hed pack ets are cached. Any arriving pack et accesses the cache fi rst. I f a cache hit happ ens, the packe t will skip tra versing th e pip eline. O th- erwise, it needs to go through th e pip eline. The cache can b e organized in any asso ciativit y . W e use full associativity as the default. F or IP lo okup, only the destination IP of the pack et is used to index th e cache, while for pack et clas- sification, multiple fields of the pac ket header ma y b e used. The cac he up date will b e triggered, either when there is a route up date that is related to some cached en try , or after a pack et that previously had a cac he miss retrieve s its searc h result from the pipeline. Any replacemen t algorithm can b e Packet 1 Pipeline Direction Index Table (DIT) Pointers to Search Results Dual-Port SRAM Distance Checker M U X M U X 0 From previous stage To next stage Distance Checker M U X M U X 0 From previous stage To next stage Figure 5: Blo c k diagram of the basic arch itecture. used to up date the cache. The Least Recently Used (LRU) algorithm is used as the default. Packet 2 Pipeline Cache DIT Pointers to Search Results Delay Queue 1 Queue 2 Delay DIT hit miss miss hit Packet 1 miss Delay hit Packet 4 Cache Packet 3 miss Delay hit Figure 6: Blo c k di agram of the enhanced architec- ture. 5. PERFORMANCE EV AL U A TION This section ev aluates the effectiveness of the prop osed sc heme and t he p erfo rmance of the p roposed arc hitecture. At first, w e examine th e memory balancing by using the bidi- rectional fine- grai ned mapping scheme. Then, we measure the throughpu t using real-life traffic traces. All exp eriments are based on sim ulation. 5.1 Memory Balancing At first, we conducted the exp erimen ts on the four routing tables given in Section 3.1 to examine the effectiv eness of the bidirectional fine-grained mapping scheme. W e used v arious inv ersion heuristics and inv ersion factor to ev aluate their impacts. In these exp erimen ts, th e number of initial b its used for partitioning the trie is 12. Then, with appropriate parameter setting, w e conducted th e exp erimen ts on the four 5-tuple ru le sets to verify the effectiveness of our scheme for decision trees. 5.1.1 Impact of the in version heuristics As discussed in S ection 3.2.2, we hav e four differen t h euris- tics to invert subtrees. N o w we examine their p erformance and obtain the results sho wn in Figure 7. The v alue of the inv ersion factor is set t o 1. According to the results, the le ast aver age d epth p er le af heuristic has the b est p erformance. It show s that, when we have a choice, a balanced su btree should b e inv erted. This can b e explained that a balanced subtree has many nod es not only at the leaf level but also at the low er levels, whic h can help balance not only the first stage but also the first several stages. 5.1.2 Impact of the in version f actor Using the lar gest le af heuristic, we changed t he v alue of the inv ersion factor. The results are shown in Figure 8. When th e inv ersion factor is 0, the bidirectional mappin g b ecomes fin e-grai ned forw ard mapping only . The mappin g turns to b e fine-grained reverse mapping when th e inv ersion factor is close to the pip eline d ep th so th at all subtrees are inv erted. 5.1.3 Short Summary According to the ab o ve results for trie-based IP lo okup, the bidirectional fine-grained mapping sc heme can achiev e a p erfectly balanced memory distribution o ver th e pip eline stages, by either u sing an app ropriate inv ersion h euristic or adopting an appropriate inversi on factor. This also shows that the architecture is fl exible th at it offers a large design space for adaptin g to different routing tables with vari ous prefix distribution. In fact we conducted more exp erimen ts on 16 routin g tables collected from [17] and obtained similar results as are presented. 5.1.4 Applying onto Decision T rees 0 5 10 15 20 25 30 200 400 600 800 1000 1200 1400 Pipeline stage ID # of tree nodes Node distribution over stages fw1_100 ipc1_1k acl1_10k fw1_real Figure 9: Bidirectional fine-grained mapping for de- cision trees. ( L ar gest le af heuris tic; Inv ersion factor = 1) 0 5 10 15 20 25 0 0.5 1 1.5 2 2.5 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (a) Largest leaf 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (b) Least height 0 5 10 15 20 25 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (c) Largest leaf p er heigh t 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (d) Least av erage dept h p er leaf Figure 7: Bidirectional fine-grained mapping with different heuristics. (In version factor = 1) Comparing Figures 2(a-b) with Figures 3(a-b), w e can find that the chara cteristics of the decision t rees are distinct from those of the routing tries. The no de distribution of the decision trees after the depth- based mapping is somewhat different from that of routing tries. There are quite a lot of nod es in the fi rst several stages, so that we can invert few subtrees in th e bidirectional m ap p ing. A lso , in Hyp erCuts, eac h step from a no de to its c hildren is a m ulti-dimensional cut, rather t han a bit scan. Hence, we cannot use prefix expansion to partition the tree. Instead, we use only the first cut to partition the tree. Figure 9 shows the results for the b idirectio nal mapping of th e decision trees. Compared to Figure 7(a), whic h uses the same setting bu t does n ot ac hieve a balanced distribution, Figure 9 exh ibits a p erfectly balanced no de distribution o ver stages. 5.2 Thr oughpu t Impro vement W e used real-life Internet traffic traces to ev aluate the throughput p erformance of the prop osed architecture. Two anonymized real-life traces w ere collected from [14]. Their information is listed in T able 2. Due to the unav ailability of public IP traces associated with their corresp onding routing tables, w e generated the routing t ables by extracting the unique destination IP addresses from the traces. T able 2: IP header traces T race # of p ac k ets # of IPs APTH: AMP-1110523221-1 769100 17628 IPLS: I2A- 109 1235138-1 1821364 1579 1 The ma jor parameters of th e architecture include the in- put width, i.e the num b er of parallel inputs, denoted P ; the pip eline depth , denoted H ; the queue size, i.e. the m ax - im um num b er of pack ets allo w ed to b e stored in a queue, denoted Q ; and the cache size, i.e. the maximum num b er of pac kets allo w ed to b e cac hed, denoted C . In these exp er- iments, the default setting for the architecture p arameters w as P = 4 , H = 25 , Q = 2 , C = 160. The p erfor mance metric is the throughp ut in terms of the num ber of packe ts pro cess ed p er clo c k cycle (PPC). Note that in P -width architecture, the throughput ≤ P . 5.2.1 Impact of the input width W e increased the input width, and observ ed the through- put scala bility . Figure 10 shows, with caching, the through- put scaled w ell with the input width, especially when P ≤ 4. 0 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (a) Invers ion factor = 0 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (b) Inver sion factor = 4 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (c) Inversi on factor = 8 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 (d) Inver sion factor = 12 Figure 8: Bidirectional fine-grained mapping with v arious inv ersion factors. ( L ar gest le af heuristic) 5.2.2 Impact of the cache size and the queue size W e ev aluated the impact of the cache size and th e queue size, respectively , on th e th roughput. The results are shown in Figures 11 and 12. Caching is efficient in improving the throughput. With only 1% of the routing entries b eing cac hed, th e th roughput reac hed almost 4 PPC in a 4-width arc hitecture. On the other hand, the q ueue size had little effect on the throughput imp rovemen t. A small queue with Q = 16 is enough for the 4-width arc hitecture. 5.3 Overall P erf ormance Based on the prev ious exp erimen ts, w e estimate the ove r- all performance of a 4-width 25-stage architecture. As Fig- ure 8(b) sho ws, for the largest b ac kbone routing table rr c11 with 154419 p refixes, eac h stage has few er th an 32K no des. A 15-bit address is enough to index a no de in the local mem- ory of a stage. Since the pip eline depth is 25, we need an extra 5 bits to sp ecify the distance. Thus, each no d e stored in th e lo cal memory needs 20 bits. The total mem- ory n eeded for storing rr c11 in a 25-stage architecture is 20 × 2 15 × 25 = 16 Mb = 2 MB, where each stage needs 80 KB of memory . W e use CAC TI 4.2[4 ] to estimate the memory access time an d th e p o w er consumption. A 80 KB dual-p ort SRAM using 45 n m tec hnology needs 0.53 ns to access, and dispatches 0.01 W of p o w er. T he maximum clock rate of the abov e arc hitecture in ASIC implemen tation can be 1.87 GHz. Considering the throughpu t of 4 PPC as shown in Figure 10 , the o ve rall throughput can b e as high as 4 × 1 . 87 = 7.5 G pack ets p er second, i.e 2.4 Tbps for the minimum pack et size of 40 bytes. S uc h a throughp u t is 14 times that of the state-of-the-art TCAM-based IP lo okup engines [24]. The ov erall p o w er consumption is 0.25 W, which is only one eigh th of that of th e “co olest” TCAM solution [23]. 6. CONCLUSIONS AND FUTURE WORK This pap er prop osed a flexible dual-p ort SR AM based bidirectional linear pip eline architecture for scalable IP lo okup and pack et classification in IP routers. By using a bidirec- tional fine-grained mapping scheme, the tree no des can b e evenly distributed onto the pip eline stages. D u e to its linear structure, th e architecture can preserve the pac ket input or- der and supp ort non-b locking route up date. U sing 2MB of memory to store a core routing table with o ver 150K entries, the arc hitecture can sustain a high throughpu t of 0.6 Tbps and can furth er achiev e 2.4 Tbps by emplo ying caching. F or multi-dimensional pac ket classification, th e operations in each stage are more complex than for the simple trie-based IP lo okup. This may affect adversely th e pip eline p erfor- 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Input width (P) Throughput (PPC) Throughput vs. Input width (Q = 2, C = 160) APTH IPLS Figure 10: Throughput vs. I nput wi dth. ( H = 25 , Q = 2 , C = 160 .) 0 20 40 60 80 100 120 140 160 1 1.5 2 2.5 3 3.5 4 Cache size (C) Throughput (PPC) Throughput vs. Cache size (P = 4, Q = 2) APTH IPLS Figure 11: Throughput vs. Cache size. ( P = 4 , H = 25 , Q = 2 .) mance. W e plan to develo p new search data structures for pack et classification so that pipelining can be more feasible. 7. REFERENCES [1] F. Bab oescu, D. M. T ullsen, G. Rosu, and S. Singh. A tree based rout er search engine arc hitecture with single p ort memories. In Pr o c. ISCA , pages 123–133 , 2005. [2] F. Bab oescu and G. V arghese. Scalable packe t classification. In Pr o c. SIGCOMM , pages 199–210, 2001. [3] A. Basu and G. Narlik ar. F ast incremental upd ates for pip elined forw arding engines. In Pr o c. INFOCOM , pages 64–74, 2003. [4] CAC TI 4.2. http://quid.hpl.hp.com:908 1/cacti/. [5] Cypress Syn c S RAMs. http://www.cypress.com. [6] W. Eatherton, G. V arghese, and Z. Dittia. T ree bitmap: hardware/s oft wa re IP lookups with incremental u pdates. SI GCOMM Comput. Commun. R ev. , 34(2):97– 122, 2004. [7] P . Gupta and N. McKeo wn. Algorithms for p acket classification. IEEE Network , 15(2):24–3 2, 2001. 0 10 20 30 40 50 60 70 3.65 3.7 3.75 3.8 3.85 3.9 3.95 4 Queue size (Q) Throughput (PPC) Throughput vs. Queue size (P = 4, C = 160) APTH IPLS Figure 12: Throughput vs. Queue size. ( P = 4 , H = 25 , C = 160 .) [8] J. Hasan and T. N . Vijaykumar. Dynamic pip elining: making IP-lo okup truly scalable. In Pr o c. SIGCOMM , pages 205–216, 2005. [9] W. Jiang and V. K. Prasanna. A memory-b ala nced linear pipeline architecture for trie-based IP lo okup. In Pr o c. Hot Inter c onne cts (HotI ’ 07) , pages 83–90, 2007. [10] W. Jiang, Q. W ang, and V. K. Prasanna. Beyond TCAMs: An SRAM-b ase d parallel multi-pip eline arc hitecture for terabit IP lo okup. In Pr o c. INFOCOM , 2008 . [11] K. S . Kim and S. S ahni. Efficient construction of pip elined m ultibit-trie router-tables. IEEE T r ans. Comput. , 56(1):32–43, 2007. [12] S. K umar, M. Becchi, P . Cro wley , and J. T urn er. CAMP: fast and efficient IP lookup architecture. In Pr o c. ANCS , pages 51–60, 2006. [13] W. Lu and S. Sahn i. P ac ket forwarding using pip elined m ultibit tries. In Pr o c. I SCC , 2006. [14] NLAN R netw ork traffic pack et header traces. http://pma.nla nr.net/traces/. [15] Pac k et Classification Filter Sets. http://w ww.arl.wustl. edu/ ∼ hs1/p classev al.html#3 . filter sets. [16] Renesas CAM ASSP Series. http://www. renesas.com. [17] RIS Raw Data. http://data.ris.ripe.net. [18] M. A. Ruiz-Sanchez, E. W. Biersac k, and W. Dabb ous. Survey and taxonomy of IP address lookup algorithms. IEEE Network , 15(2):8–23, 2001. [19] SAMSU NG High Sp eed SRAMs. http://w ww.samsung.com. [20] S. S ingh, F. Babo escu, G. V arghese, and J. W ang. P ack et classificati on using m ultidimensional cu tting. In Pr o c. SI GCOMM , pages 213–224, 2003. [21] V. Sriniv asan and G. V arghese. F ast add ress lookup s using controlled prefix expansion. ACM T r ans. Comput. Syst. , 17:1–40, 1999. [22] Xilinx Virtex FPGAs. http://www .xilinx.com. [23] F. Zane, G. J. Narlik ar, and A. Basu. CoolCAMs: P ow er- efficient TCAMs for forw arding engines. In Pr o c. INFOCOM , pages 42–52, 2003. [24] K. Zh eng, C. Hu, H. Lu, and B. Liu. A TCAM-based distributed parallel I P looku p sc heme and p erfo rmance analysis. IEEE/ACM T r ans. Netw. , 14(4):863– 875, 2006. Stage 1 Stage 2 S t a g e 3 S t a g e 4 Lookup Ta ble for Initia l bits Stage 2 S t a g e 1 S t a g e 3 00* 0 1 * 1 0 * S t a g e 4 1 1 * Queues Reord ering Buffer (optional) 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 0 5 10 15 20 25 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 0 5 10 15 20 25 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 4 Pipeline stage ID # of tree nodes Node distribution over stages rrc00 rrc01 rrc08 rrc11 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Invertion factor Memory wastage Memory wastage vs. invertion factor rrc00 rrc01 rrc08 rrc11 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Invertion factor Memory wastage Memory wastage vs. invertion factor rrc00 rrc01 rrc08 rrc11 Stage 1 Stage 2 Stage 3 Stage 4 D a t a p a t h a c t i v e d u r i n g o d d c y c l e s D a t a p a t h a c t i v e d u r i n g e v e n c y c l e s Stage 2 Stage 1 S t a g e 3 0* 01* 10* S t a g e 4 110* Indexed bits Starting Stage 0 500 1000 1500 2000 2500 3000 3500 4000 0 5 10 15 KBytes ns SRAM size vs. access time single R/W port dual R/W ports
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment