Selecting Microarchitecture Configuration of Processors for Internet of Things

1 Selecti ng Microarc hitecture Conﬁguration of Processors for Internet of Things Prasanna Kansak a r , Student Member , IEEE and Arslan Munir , Senior Membe r , IEEE Abstract —The Internet of T hings (IoT) makes use of ubiqui tous internet connectivity to form a network of everyday physical objects for purposes of automation, remote data sensi ng and centralized management/control. IoT objects need to be embedded wi t h processing capabilities to fulﬁll th ese services. The design of proc essing units for IoT objects is constrained by various stringent requirem ents, such as perfo rmance, power , thermal dissip ation etc. In order to meet th ese dive rse requirements, a multitude of processor design parameters need to be tuned accordingly . In this paper , we propose a temporally efﬁ cient design space exploration methodology which determines p ower and performance optimized microa rchitecture conﬁgurations. W e also d i scuss the possible combinations of these microa rchitecture conﬁgurations to f orm an effective two-tiered heterogeneous processor fo r IoT applications. W e ev aluate our design space exploration methodology u sing a cycle-accurate simulator (ESESC) and a standard set of P ARSEC and SPLAS H2 benchmarks. The results show that our methodology determines microa rchitecture conﬁ gurations which are within 2.23%– 3.69% of the conﬁgurations obtained from f u lly exhaustive exploration while on l y exploring 3%–5% of th e design sp ace. Our methodology achiev es on av erage 24.16 × speedup in design space exploration as compared to fu lly exhaustive exploration in ﬁndin g power and performance optimized micr oarchitecture conﬁgurations for processors. Index T erms —Internet of Things (IoT), d esign space exploration, micro architecture, tunable processor p arameters, cycle-accurate simulator (ESESC), P ARSEC and SPLAS H2 benchmarks I . I N T R O D U C T I O N A N D M OT I V A T I O N T HE internet h as g r own rap idly in both enterprise and consume r markets. This has given rise to th e Internet of Thing s (IoT) wherein e veryday physical o bjects are interconnected through a commu nication network for purpo ses of automa tio n, remote data sensing and cen tralized managem ent/control. The Io T creates an intelligent, invisible network fabric that can b e sensed, controlled and pro g rammed which allows objects in IoT ec o system to co mmunicate, directly o r indirectly , with each other or th e In ternet [1]. The “thin gs”, in the scop e of IoT , are IoT enable d objects co ntaining sensing and actuating elements along with embedd e d ha rdware and sof tware componen ts wh ich facilitate data aggregation , network con nectivity and security . Each IoT enabled object is d esigned to perform an application speciﬁc task using data gath e r ed by itself or using inform ation made av a ilab le to it throug h o ther objects in the network. There has been widespr ead deploymen t of Io T o bjects in recent years in various application s like h ealthcare, industry , tran sportation The authors are with the Department of Computer Science, Kansas State Uni versi ty , Manhatta n, KS e-mail: { pkansakar @ksu.edu, amunir@ksu.edu } etc. It is estimated that 6. 4 billion connected end -devices ar e in use in th e year 2016 [2], with the nu mber expe cted to rise to 26 billion b y the year 2 0 20 [1]. The massive d eployment of IoT objects r esults in generatio n of large volumes of data. Data com munication , p rocessing, real-time analy sis and security of such large volumes o f data are imp ortant issues that need to be resolved fo r efﬁcient growth of the Io T ecosystem in th e years to come. In the current IoT m odel, IoT end - devices are designed to b e as simple a nd as cost effecti ve as possible. Thus, they ar e designed with limited processing capab ilities, just enough to securely conn ect and ofﬂoad data to the c lo ud. Almost all complex data man agement function alities suc h as data ﬁltering and analysis are delegated to cloud datacen ters, the core units of the IoT model. W ith th e g rowth in d ata volume in the IoT ecosy stem , there r ises se veral signiﬁcan t ch allenges which r e nders this mo del infe a sib le. W e list her e three such challenges. • Network Overload - Core network bandwidth is a vital resource in the IoT eco system wh ich must b e used efﬁciently . With ever increasing num b er of IoT o bjects, relaying data over the co re ne twork to the cloud, the network is sev erely overloaded. Ne twork overloads introdu c e latency in critical data processing oper ations which impac t most IoT applica tio ns such as hea lthcare and transportatio n that require real time data pro cessing. • Data security - D a ta commun ic a tion in the IoT ecosystem mostly occurs over th e public network infrastru cture. In order to en su re secu re data com munication , several complex security proto cols m ust be a p plied to th e data. The volume of d ata requirin g secur ity increases as th e number of Io T o bjects deployed in the IoT ecosystem increases. Ap plying com plex security protocols to large volumes of data r e quires extensiv e computing ope rations which cann ot be m atched by th e energy budget of I o T objects. • Upgradability - As the IoT lan dscape continu es to ev olve, it beco mes necessary to upgrade Io T deploymen ts in frequen t period s. IoT objects must be designed to suppo r t hassle fr ee additio n of new f eatures via remote access. In an ideal IoT model, IoT objects must b e able to up grade to new , more complex featu res witho ut deployment of ne w Io T objects and without any d irect hu man in volvement. W ith limited processing ability , ad dition of new featu res to existing Io T objects m a y be challenging or even infeasible. The ch allenges posed by the cu r rent IoT model can b e overcome by adding pro cessing capab ilities inside or lo cal 2 to I oT objec ts [3]. With the add ed pro cessing units, data managem ent oper ations suc h a s ﬁltering and analysis can be carrie d out within the local n e twork. IoT ob jects can thus, co m municate summaries of info rmation, obtained fro m ﬁltering the aggregated data, to the clou d. This contributes signiﬁcantly to freeing up the core ne twork b andwidth. The reduction in data volume also redu ces the energy expend iture on data secur ity as less data requ ires lesser number of computin g oper ations to secu re. Ha ving more processing ability also makes Io T deployments m ore ﬂexible to upgrad es as newer featur e s can be added witho u t signiﬁcantly burdening the system. Processing units interfaced with IoT o b jects require an optimal balan ce b etween power and perf o rmance [4]. Since many IoT objects are battery powered, it is desirable that these objects operate for th e ir en tire lifetime with the batter y they are deployed with (e.g. medical sensors implan ted in to a patient’ s body via in v asi ve surgical pr o cess). A lth ough gr eat p rogress has been made in batter y technology , batteries are still not able to keep p ace with the demands of modern electronics [5]. So , power optimization mu st be co nsidered in p arallel with per formanc e o ptimization. T IER 1 T IER 2 I NTERCONNECT H OST P ROCESSOR HIGH PERFORMANCE OPTIMIZED I NTERF ACE P ROCESSOR S ENSING E LEMENTS A CTUA TION C ONTROL LOW POWER OPTIMIZED I NTERF ACE P ROCESSOR S ENSING E LEMENTS A CTUA TION C ONTROL LOW POWER OPTIMIZED I NTERF ACE P ROCESSOR S ENSING E LEMENTS A CTUA TION C ONTROL LOW POWER OPTIMIZED Fig. 1. T wo-tiere d hete rogeneou s processor archit ecture model for IoT For incorporatin g higher lev els of power op timized perfor mance in IoT deploym ents, a two-tiered heterogene o us processor arch itecture is suitable [3] [6]. This tw o-tiered architecture , shown in Figure 1, consists of a host pro cessor , optimized f o r high pe rforman ce, interfaced with a n umber of interface pr ocessors, optimize d for low power o peration. The interface pr ocessors collect data fro m data- sensing elem ents and control actuating elements. These processors are a lways operated in active mo de beca use their lo w p ower o peration does no t se verely impact battery life . Higher end function , such as ﬁltering and an alysis of data, and, imp lementation of complex security pr otocols are perfo rmed by the ho st processor . Since these o perations are in frequen t, th e power hungr y h ost p rocessor is m ostly op erated in sleep state and only activ ated intermittently for limited duration s. Designing efﬁcient embedde d processors with p ower - optimized perfor mance, for use in IoT objects, is a tedious pr ocess. Prev enting high per forman c e processors from v iolating th e power budget requ irements dictated b y the mar ket is an enor mous design c h allenge [7]. The oppor tunities fo r optimizing a processor design for power are the g reatest at the architecture lev el [7]. Thu s, power and perfo rmance optimizatio ns sho uld b e perfo rmed while deﬁning the micro architecture con ﬁguration of proc essors. Th e microarch itecture conﬁgur ation consists o f se veral processor design para m eters each o f which has to be tuned based o n the im pact it has on the overall power and p erforma n ce of the processor . Selecting a microar chitecture c onﬁguratio n in volves rigorous design space explor a tion over a search space consisting of all p ossible setting s for tu nable processor d esign parameters. Th ere are two main challenges that need to be addressed in th is process. Firstly , the design space exploration metho dology , employed to select micro architecture conﬁguratio ns of processor s f o r IoT objects, m u st be tem porally efﬁcient. Lon g pro cessor design time leads to long time to market wh ich results in lo wered proﬁts [8] [ 9] and sho rter prod uct life cycle [ 9]. The Io T market also lacks accepted industry standards so, tho se who get to the market ﬁrst have th e greatest op portun ity to inﬂuence those standard s [9]. Secondly , the design space exploration m ethodolo gy must balance p r ocessor power consum ption with p erform a n ce, which a re conﬂicting design metrics [10]. It is not p ossible to h av e optimal solutions f o r optimiz a tion pro blems with conﬂicting design metrics. Th e optimization prob le m shou ld instead be mod eled as an Optimal Production Frontier p r oblem also k n own as Pareto Efﬁciency [11] pr o blem. Multiple solutions are obtained for such p roblems wh ere each so lu tion fa vors one of th e co n ﬂicting metrics. Th e design sp a c e exploration method ology must intelligently choo se the best trade-off solution based on applica tio n sp e ciﬁc requ ir ements. In this p aper , we propose a tem porally efﬁcient desig n space exploration m ethodolo g y f or determinin g power and perfor mance optimized m icroarchitectu re co nﬁguratio n s of embedd e d processor s used in IoT objects. W e use a combinatio n o f exh austi ve, greedy a nd one-sho t sear ch methods to perfor m design space exploration. W e verify the effecti veness of o ur methodo logy b y testing it on a cycle accurate simulator using a large set of standard benc h marks with varying workloads. The main co ntributions of our paper are: • W e propo se a temp orally efﬁ cient design space exploration methodo logy to ﬁn d microar chitecture conﬁgur ations for low-power and high-per f ormance optimized embed ded p r ocessors used in Io T objects. • W e inclu de a thr eshold parame te r in the design space exploration metho dology which ca n be manipu lated by the system d esigner to control design time based on tim e to market constrain ts. • W e propose exhaustive, greedy and one-shot search algorithm s which y ield microarc h itecture co nﬁguration s which are 2.2 3%-3.69 % of the m icroarchitectu re conﬁgur ations obtain ed from fully exhaustive search. • W e distinguish between d ifferent micro architecture conﬁgur ations based on th e size and type of bench m ark used, and, relate them with po tential u se cases in IoT . The remaind er of the paper is organized as follows. In Section I I, we present a revie w of related work. W e de scr ibe our design space exploration metho dology in Section III and 3 elaborate on its different ph ases in Section IV. In Section V we describe the cycle-accur a te simula to r and bench m arks used to test our methodo logy . W e discuss the results in Section VI and p resent o u r conclusions and future research dir ections in Section VII. I I . R E L A T E D W O R K Sev eral SoC design com panies hav e released articles on technique s of in creasing p rocessing cap abilities in IoT objects. Some articles g uide the selection of pr ocessors for IoT objects while o thers describe low p ower optimized processor architecture s for Io T deploymen ts. ARM propo sed a processor a rchitecture consisting of multiple homoge n eous pr ocessors in a s ingle IoT object each serving a different pu r pose [6]. They deﬁned a sy stem with three Cortex-M p rocessors, one to handle network connectivity , one to manag e interface with sensors and actuators an d one as a host p rocessor co ntrolling the oth e r two. They stated that mu ltiple processors are b etter for lowering power co n sumption in IoT o b jects since o nly the p rocessor serving the current task would be in a cti ve mode while the rest would be in sleep mode. ARM also pr oposed a guide to selecting micro controller s for IoT objects [12]. In this guide, th ey argued that high -end microco ntrollers were suitable for Io T d eployments for tw o rea so ns. Firstly , high- end micro controllers com plete p rocessing tasks soon er an d can en ter sleep mode to conserve power and secondly , larger ﬂash and RAM sizes av a ilab le with hig h-end micr ocontro llers facilitate implementatio n of c o mplex n etworking p rotocols without addition o f any new processor s in the sy stem . T h ese articles clearly demo nstrate the n eed for having m ore p ower - optimized perf o rmance in IoT deploym ents. Synopsys also propo sed the use of mu ltiple proc essors in IoT deploymen ts [13]. They described the use o f two- tiered processor architecture in IoT ob jects – u ltra low power embedd e d pro c e ssors used to interface with sensing elem e nts to collect, ﬁlter an d pr ocess data and host processor u sed to m anage embedd ed processors. Their pr ocessor arch itecture lowered power co nsumption by keeping power hun gry host processor mostly in sleep mode, similar to the concept used by ARM. Synop sys also discussed optimiza tio n of pr ocessors using conﬁgura b le har dware extensions fo r sensor applications [13]. They stated th at ad d ing custom hardware extension s for executing typical sensor fun c tions redu ces the pr ocessor cycle count requir ed to execute sensor application s. The red uction in cycle count lowers energy co nsumption either by lowering the cloc k freque n cy and keeping the same execution time, or having the same power but sh o rter execution time. Apart from research carried out by SoC de sig n com panies, processor design has also been extensi vely stud ied in academia [14] [15]. T h ere a re many resear c h works in literature in volving optimized p rocessor design. Most works employ design space explora tion [1 6] [1 7] techniques utilizing searc h meth ods like exh austi ve and greed y search and optimizing algor ithms like gene tic and e volutionary algorithm s. Givar gis et al. [18] developed an e xploration methodo logy n amed PLA TUNE (PLA Tform TUNE r) that carried out e xhaustive searche s in two stages: ﬁrst, over clusters of strongly in te r connected param eters to obtain Pareto-optimal conﬁgu rations local to each clu ster , an d second, over all the clusters to o b tain a glo bal Pareto- optimal solutio n. Th e appr oach could explo re design spaces as large as 10 14 conﬁgur ations, but it took an ord er of 1- 3 days to complete. Palesi et al. [19] argued that the high exploration time for PLA TUNE was du e to the fo rmation of large partial search spaces in the clustering p rocess. Palesi et al. improved the PLA TUNE explo ration metho dology b y introdu c ing a ne w thresho ld value that distinguished between clusters based o n th e size of their partial search-space. Exhaustive search method was used for clusters with partial search-spaces smaller than the thresh o ld value and a genetic exploration algorithm was used for larger spaces. Thr ough this im provement, th ey were a ble to achieve 80 % reduc tion in simulatio n time while still remain ing within 1% of th e results o btained from exhau sti ve search . Genetic algorithms were also used in the system MUL TI CUBE, by Silvano et al. [20]. Th e MUL TICUBE system deﬁne d an autom atic design space exploration a lgorithm that co uld quickly determine an approx imate Pareto fron t fo r a given design requirements. Munir et al. [ 21] pro posed an other alternative to overcome the overhead of exhau sti ve search in their work on dyn amic optimization of w ir eless sensor netw orks. Their approach was d ivided into two phases. In the ﬁrst phase, a one-sho t search algorith m selected initial p arameter settings an d further ordered th e pa r ameters based on their signiﬁcance tow ards the application req u irements. In the second ph ase, a g reedy algorithm was u sed to search the d e sign spac e . Their appro ach yielded a design con ﬁguration th at was within 8% o f the optimal conﬁgur ation while only exploring 1% of the design space. In this paper, we improve on the work carried out by Munir et al. [21]. W e le verage a similar ap p roach to design space exploration but add two new p hases: a set-partition in g ph ase and an exhau sti ve search phase. T he ad dition of th e exhau stive search phase aim s at inc r easing the degree of closeness to the optimal solution by explorin g a larger portion of the design space, a s argued by Silvano et al. [20]. T he limit on the number of con ﬁg urations c o nsidered in th e exhaustiv e search is determined by the set-partition ing phase that uses a threshold value [19]. I I I . M E T H O D O L O G Y Our design space exploratio n m ethodolo gy for determining optimal microarch itecture conﬁguration of embedd ed processors fo r I oT is shown in Figure 2. Our methodo logy is implem ented in four phases – initial o n e-shot search conﬁgur ation tuning and parameter signiﬁcance, set- partitioning , exhaustive search conﬁgu ration tun ing and greedy search co nﬁguration tuning . The initial one-shot search c onﬁguratio n tuning and parameter signiﬁcance phase is carried out by the initial one-sho t search conﬁg uration tuning module and the parameter signiﬁcance ordering module. The m icroarchitec tu re conﬁgur ation parameter settings set, which consists of all the po ssible settings for each tun able m icroarchitectu re parameter, is provided as in put to the initial one-sho t search 4 Microarchitecure con ﬁ guration parameter settings set Initial one-shot con ﬁ guration tuning module Cycle - accurate simulator Test benchmarks Parameter signi ﬁ cance ordering module Set partitioning Separated exhaustive search set Separated greedy search set Exhaustive search con ﬁ guration tuning module Greedy search con ﬁ guration tuning module Best settings of parameters in exhaustive search set Optimized microarchitecture con ﬁ gurations for processor for IoT Weights for design metrics Exploration threshold Signi ﬁ cance ordered parameter set INPUT OUTPUT INPUT INPUT INPUT Fig. 2. Design space explor ation methodology for determi ning optimal microarch itect ure conﬁgurat ion of embedded processor for IoT conﬁgur ation tuning module b y the system designer . This module u ses the p a rameter settings set to g enerate initial test conﬁgur ations. Each initial co nﬁguration is passed to a cycle- accurate simulator . Th e test bench marks fo r ev aluating the microarch itecture co nﬁguratio n s ar e provid ed as inpu t to the simulator by the system de signer . The simula to r executes each initial test con ﬁguration sepa r ately fo r each test b enchmark speciﬁed. The test bench marks p r ovide varying workload s for testing the initial test conﬁguration s. The sy stem designer also provides the weigh ts for balancin g d esign metrics as input to the simulato r . These weig hts are u sed to spec if y the p referred tradeoff b etween conﬂicting design metrics. The simulator mo dule ev aluates the initial test conﬁgur ations sup plied by the initial one-sho t search conﬁgur ation tuning modu le to determine the b est initial setting f or each tunable mic r oarchitectur e pa r ameter . The simulation r esults are forwarded to the parame ter signiﬁcance orderin g module wh ere the tunab le micro a r chitecture parameters are or d ered based on their sign iﬁcance to the design metrics co nsidered. The ordered set of signiﬁcance values is comm unicated to the set-pa rtitioning modu le which separates the pa rameters into two search sets – exhaustive a nd greed y . The par a meters are sep arated based o n an exploration threshold value provid e d by the system design er . The exploration thr eshold value is used to contr ol search space for the exhau sti ve search phase of our design space exploration m ethodolo gy . T he exh a u sti ve search phase is the longest phase in the design space exp lo ration methodo logy and pro cessor d esign time c a n be sign iﬁcantly altered by varying this exploratio n thresho ld value. The microarchitectur e parameters separated out in th e exhaustiv e search set are communicate d to th e exhaustive search con ﬁguration tuning mod ule. This m odule generates test co nﬁguration s using all possible combin ations of tuna ble processor d e sign para meters. Th e parameters which are not in th e exhaustiv e search set retain their best setting s fr om the initial one- shot sear c h conﬁgura tion tunin g pr ocess. These test conﬁgur ations are evaluated o n the cycle-accu rate simulator to determine a test conﬁguration p ossessing the best tradeoff between the conﬂicting design metrics consider e d. The best settings f or the micr o architecture p arameters in the exhaustiv e search set are then commu nicated to the greedy searc h conﬁgur ation tun ing module. The greed y search conﬁg uration tun ing modu le gen e r ates test conﬁguration s u sing the p rocessor d esign parameters separated out in the gr e edy search set. A greedy search algorithm (r efer Section IV -D ) is used to generate these test conﬁguration s. The microarch itecture param eters in th e exhaustiv e search set retain the ir best setting ob ta in ed fro m the exhaustiv e sear ch simulation p rocess. Th e parameters wh ic h are in neither of th e two search sets, retain their b est settings from the initial one-sh ot search co nﬁguratio n tuning p rocess. The best c onﬁguratio n ob tained at the end of the g r eedy search co nﬁguratio n tun ing proce ss is communicated bac k to the pro cessor d esigner as the optimal microar chitecture o f the processor with the preferr ed tradeoff betwee n the conﬂicting design metrics. A. De ﬁning the Design Space Consider n num ber of tunable p a rameters are av ailable to describe the microar chitecture conﬁguration of an emb edded processor for IoT . Let P be the list of these tu nable pa rameters deﬁned as the f ollowing set: P = { P 1 , P 2 , P 3 , · · · , P n } (1) Each tunab le parameter P i [where i ∈ { 1 , 2 · · · n } ] in the list P is the set of po ssible settings for i th parameter . Let L be the set containing the size of th e set of possible settings f o r each param eter in list P . L = { L 1 , L 2 , L 3 , · · · , L n } (2) such that, L i = | P i | ∀ i ∈ 1 , 2 , · · · , n (3) where | P i | is the cardinal value o f set P i . So, each parameter setting set P i in th e list P is d eﬁned as follows: P i = { P i 1 , P i 2 , P i 3 , · · · , P iL i } ∀ i ∈ { 1 , 2 , · · · , n } (4) The values in the set P i are arrang ed in ascending order . The state space for design space explo ration is the collec tio n of all the possible conﬁgur ations that can be obtain ed using the n pa r ameters. S = P 1 × P 2 × P 3 × · · · × P n (5) 5 Here, × represents the C artesian produ ct of lists in P . Throu g hout this paper, we use the term S to denote the state space com p osed o f all n tun able parameter s. T o main tain generality , when refe r ring to a state space com posed o f a tunable parameter s where a < n , we attach a subscript to the term S . S a = P 1 × P 2 × P 3 × · · · × P a ∀ a < n (6) W e note th at the state space of a tunab le parameters d oes not constitute a complete d e sig n co nﬁguration and is only used as an intermed iate when deﬁning o u r m ethodolo gy . W e also reserve the u se of × op erator in the following manner: S a = S a × P i ∀ i ∈ { 1 , 2 , · · · , n } (7) This represents the extension of th e state space S a to include one new set of parameter settings P i from the list P . T h is operation increases the numb er of tunable parameters in state space a by one. When ref erring to a d esign conﬁgur ation that belongs to the state space S , we use the term s . W e attach sub scripts to s to refer to speciﬁc design co nﬁguratio n s. For example, a state s f that consists of the ﬁrst setting of each tunable par ameter can be written as: s f = ( P 11 , P 21 , P 31 , · · · , P n 1 ) (8) Similarly , to denote an incomp lete/partial design conﬁguration of a tu n able param eters we use the term δ s a . B. B enchmarks Each o f the co nﬁguration s, selec te d from the state space S by our m ethodolo g y , is tested o n m numb er of test ben chmarks. The design metrics f o r each simulated conﬁgur ation is co llected separately f o r each ben chmark. C. Objective Functio n In our metho dology , d esign c o nﬁguratio n s are comp ared with each other based on their objective f unctions. The objective fu n ction of a design co nﬁguration is the weigh ted sum of the nor malized desig n m etrics obtained after simulating that d esign con ﬁguration. Let o be the n u mber of design metrics and V be the set o f norm alized values of d esign metrics which are obtained from the simulation. V k s = { V k s 1 , V k s 2 , V k s 3 , · · · , V k so } ∀ k = 1 , 2 , · · · , m (9) Let w b e the set o f weights for the d e sig n metrics based o n the r equiremen ts o f the targeted ap plication. These weights are set by the system design er . w = { w 1 , w 2 , w 3 , · · · , w o } (10) such that, 0 ≤ w l ≤ 1 ∀ l = 1 , 2 , · · · , o (11) and, X w l = 1 ∀ l = 1 , 2 , · · · , o (12) T ABL E I L I S T O F S Y M B O L S Symbol Description n Number of tuna ble microarc hitec ture parameters P List of tunable microa rchite cture parameters P i Set of possibl e sett ings for i th tunable microarchite cture paramete r L Size of set of possible setting s for each tuna ble microarch itect ure paramete r L i Cardina l va lue of set P i S State space for design space explorat ion S a Parti al/Inc omplete state space s tag State in state space S with ‘tag’ identiﬁer δs a State in partial state space S a m Number of test bench marks o Number of design metrics V k s Set of normalize d value s obtained for design metric s from simulatio n of state s for k th benchmark w Set of weights for design metrics w l W eight for l th design metric F k s Objecti ve functio n obtained from simulating state s for k th benchmark The ob jectiv e function F of a design conﬁg u ration s fo r a test benchm a rk k is deﬁned as follows: F k s = X w l V k sl ∀ l = 1 , 2 , · · · , o (13) The optimization problem , consider ed in this paper, is to minimize the value of th e objec tive fun ction F . The d esign metrics are chosen such that the m inimization of their values is the fav orable design ch oice. For example, when considerin g the perfor m ance m etric, the design go al is to maximize performan ce. T o mo del this into the ob jectiv e function which w e use execution time to measure p erform a nce. Minimizing execution time would ﬁt with minimizin g the objective function while still mod eling th e design goa l of maximizing performan ce. The optimization pr oblem for each test bench mark k is deﬁned as follows: min. F k s s.t. s ∈ S (14) T able I pr esents the sy m bols established in this sectio n in list form. I V . P H A S E S O F M E T H O D O L O G Y Our proposed d e sign space explora tio n method ology consists of fo u r distinct phases. In this section, we e lab orate on the steps in volved in each phase using th e notatio n set up in Section III. A. P hase I : I n itial One-S hot Sear ch Conﬁg uration T un ing and P arameter S igniﬁcan c e In this phase of ou r m ethodolo g y , best initial setting for each tunable microarch itecture p arameter in set P is deter mined by u sing a one-sho t search con ﬁguration tunin g process. The one-sho t search pro cess is based o n single factor analy sis which is an effecti ve heuristic appro a c h used in design space exploration [ 22]. Unlike sin g le factor analysis wherein parameters can have on ly tw o settings, a ze r o value and a non-ze r o v alue setting, o ne-shot search works on parameters 6 with m o re than two no n-zero value settings. In on e - shot search process, p arameters are evaluated on a on e by one basis. T wo test con ﬁgurations are genera ted f or each pa rameter, one with the ﬁrst setting a nd one with the last setting s f rom the list of settings for th e curren t parameter . The remaining par a m eters are arb itrarily set to th e ir ﬁr st setting fr o m their corresp o nding list of settin g s. Algorithm 1 : Initial One-Sh ot Search Con ﬁg uration T u n ing and Parameter Signiﬁcance Input: P - List of T unable Parameters Output: B - Set o f Best Setting s; D - Signiﬁcance of Parameters with respect to Objecti ve Function 1 for i ← 1 to n do 2 s f = { P i 1 } 3 s l = { P iL [ i ] } 4 for j ← 1 to n do 5 if i 6 = j t hen 6 s f = s f ∪ { P j 1 } 7 s l = s l ∪ { P j 1 } 8 end 9 end 10 for k ← 1 to m do 11 Explore k th benchm a rk using conﬁguration s f 12 Calculate F k s f 13 Explore k th benchm a rk using conﬁguration s l 14 Calculate F k s l 15 D k i = F k l − F k f 16 if D k i > 0 then 17 B k i = P i 1 18 else 19 B k i = P iL [ i ] 20 end 21 end 22 end The steps in volved in initial one-shot search conﬁguratio n tuning and d e termining parameter signiﬁcance are deta iled in Algorithm 1. Th e ﬁrst and last te st conﬁgu rations gener ated for e valuating a tunable microarchitec ture p arameter, P i in set P , are d enoted by s f and s l , respectively . These co nﬁguration s are tested o n the cycle-accu rate simu lator . From the results of the simulation , ob jecti ve functions, F s f and F s l correspo n ding to s f and s l , respectively , ar e dete r mined. T he o bjectiv e function values are used to determine best initial setting as well as signiﬁcance of each micr oarchitectur e parameter . The magnitud e of the difference b e tween F s f and F s l , which is stored in param eter signiﬁcance set D (line 15), is u sed as parameter sign iﬁcance. Th e hig her the magnitud e o f a difference D k i , i ∈ { 1 , 2 , 3 , . . . , n } for a bench m ark k , k ∈ { 1 , 2 , 3 , . . . , m } , the highe r is the signiﬁcance of parameter P i to the work load characterized by benchm a rk k . Th e sign o f the difference between F s f and F s l is used to pick th e best initial setting for parameter P i . If the difference is p ositi ve, then the ﬁrst setting of p arameter P i is chosen as the best setting, o therwise the last setting is chosen. Th e best settings for the parameter s are stored in th e set of best settin g s B k i (lines 17 and 19). B. P hase II : Set-P artitionin g Algorithm 2: Set-Partitionin g Input: D - Sig niﬁcance of Parameters towards Objec tive Function; I - Index Set; T - Exhaustive Search Threshold Factor Output: E - Set of Parameters for Ex haustiv e Search; G - Set o f Parameters for Greed y Sea r ch 1 E = ∅ and G = ∅ 2 for k ← 1 to m do 3 sortDescending ( | D k | )- s.t. ind ex information of the sorted values is pre served in I k 4 sort( P k ) and sort( L k ) w .r .t. in dex informatio n in I k 5 num E = 1 and i = 1 6 while num E ≤ T do 7 num E = num E × L k i 8 if num E ≤ T t hen 9 E k = E k ∪ { P i } 10 i = i + 1 11 else 12 break 13 end 14 end 15 num G = ceil (( | P k | − |E k | ) / 2 ) 16 while num G > 0 do 17 G k = G k ∪ { P k i } 18 num G = num G − 1 19 i = i + 1 20 end 21 end The set-p a r titioning phase, presen ted in Alg orithm 2, shows how the parameter signiﬁcan c e values d etermined in the ﬁrst ph ase of our meth odolog y are u sed to separ ate the list o f tunab le microar c hitecture parameters into exhaustiv e and greed y search sets. First, the parameter signiﬁcance set | D k | for each benchmark k, k ∈ { 1 , 2 , 3 , . . . , m } , is sor ted in descendin g o rder of magnitu d e using the sortDescending ( | D k | ) function . The ind ex inf ormation of the sorted values is preserved in a set o f ind exes I k (line 3). For example, if the ﬁfth entry D k 5 has the greatest value, D k 5 will becom e th e ﬁrst entry in the set D k and ﬁrst entry in the set of indexes I k will be 5, th at is, I k 1 = 5 . The set of indexes, I k , is u sed to sort the list of tunable m icroarchitectu re parameters, P k , and list of set sizes, L k . After sor tin g, the parameters with higher signiﬁcance lie to wards the start o f the set and the pa r ameters with lower signiﬁcance lie tow ards the end of the set. The list of param eters is then divided into three subsets, exhaustive search , gre edy search and on e-shot search sets. The exhaustive search set gets p arameters with the highest signiﬁcance. T he number o f parameters separated into the exhaustive search set depends on the explor ation threshold value, T , provide d by the system designe r . The thr eshold value 7 T limits the size of the p artial sear ch space of the exhaustive search set, nu m E (line 6). After separating out exhaustive search set, the param eters remaining in the parameter list are separa te d in to greedy search and on e-shot search sets. The list o f re maining parameter is divided into two halves (lin e 15) and the upper half ceil (( | P | − |E k | ) / 2) is separ ated as th e greedy search set and the lower ha lf is separated as on e-shot search set. W e observe emp irically that dividing the list of remaining p arameters into halves provid es efﬁcient d esign space exploration withou t signiﬁcantly compro mising the solution qu ality . The parame ter s separated as o n e-shot search set ar e n ot explore d f urther an d are left at the best settings determined fo r the m in Algo rithm 1. C. Phase III : E xhaustive Sea rc h Conﬁguration T u ning Algorithm 3: Exhau sti ve Search Input: P - List of T unable Parameters; B - Set of Best Settings fo r One-shot Search; E - List of Parameters for Exh austiv e Search Output: B - List o f Best Settings for One-sho t and Exhaustive Search 1 s E = ∅ 2 δ s E = ∅ and δ s E ′ = ∅ 3 for k ← 1 to m do 4 F k s b = ∞ 5 for i ← 1 to n do 6 if P i / ∈ E k then 7 δ s k E ′ = δ s k E ′ ∪ { B k i } 8 end 9 end 10 for i ← 1 to n do 11 if P i ∈ E k then 12 S k E = S k E × P i 13 end 14 end 15 for j ← 1 to | S k E | do 16 δ useds k E j is a pa r tial conﬁgura tio n in state space S k E 17 s k E = δ s k E j ∪ δ s k E ′ 18 Explore k th benchm a rk using conﬁguration s k E 19 Calculate F k E 20 if F k s E < F k s b then 21 F k s b = F k s E 22 B k = s k E 23 end 24 end 25 end Algorithm 3 details the steps in volved in the exhau sti ve search process. The exhau sti ve search process determ ines the best setting s for the par a m eters in the exhaustive search set E . First, the settings for the param e te r s that are no t in the exhaustiv e search set E are assign e d ( line 7). These parameters are assign ed th eir b est setting s fro m th e set of best settings B k i as de te r mined in the initial one-shot search conﬁgur ation tuning pro cess describ ed in Algorithm 1 . These settings make up the partial test design conﬁguration δ s E ′ . Next, a partial state spac e S E is fo rmed for the parameters in the exhausti ve search set E (line 12). Every possible partial test design con ﬁguration, δ s E j (line 1 6), in the par tial state space S E , is co mbined with the partial test design conﬁgur ation δ s E ′ to form comple te simulatable test design conﬁgur ations. Each comp le te test design conﬁgu ration is ev a lu ated on the simulator . An objective function value, F s E , is obtained for each comp lete test design conﬁgur ation, s E , from the simu lato r . The algorithm keeps track of the smallest objective fun ction value encoun te r ed in the search pro c e ss in F s b which re p resents the best ob jecti ve function value. When a d esign c onﬁguratio n results in an objective f unction that has a value less than F s b (line 20), then F s b is chang e d to the new minimum value and the set of best setting s B is updated with the co rrespond ing design con ﬁguration . D. Ph ase IV : Greedy Sear ch Con ﬁguration T un ing In the ﬁnal phase of our methodo logy , describ ed in Algorithm 4, the best settings for the parameters in greedy search set G are determin ed. For each par a meter in the set G , the sign of the param eter signiﬁcanc e is ch ecked to d etermine whether the ﬁrst setting or last setting was chosen as the best setting in the ﬁrst phase o f our methodo logy . I f the sign of parameter sig n iﬁcance is po siti ve, then it indicates that ﬁrst setting fo r th at param eter yields a smaller objective fun ction as com pared to the la st. If the sign is negative then it indicates that the last setting for that p arameter yields a smaller objective function as comp ared to the ﬁrst. W e assume that the setting that yields th e smallest objectiv e fun ction lies closer tow ards the setting that yields the smallest objectiv e function in the initial o ne-shot sear ch c onﬁguratio n tu n ing pr ocess. T o ensure that th e search pr ocess starts fro m the setting that yielde d the smallest objective fu nction in the in itial o ne-shot search conﬁgur ation tun in g pr ocess, we sort the set of parameter settings P i in descending ord er ( f or last setting as best setting) or left u nchange d in de fault ascending order ( for ﬁrst setting as best setting ) (line 8). In th e greed y search process, the par ameters in the g reedy search set are co nsidered on e a t a time. First, a partial test design con ﬁguration δ s G P ′ is fo rmed using the exhaustive search set, the one - shot search set and the n o n-curr ent parameters in greed y search set. The parameters in the exhaustiv e sear c h set, E , are assigned their best values as determined in the exhaustive search conﬁguration tuning process. The parameters in the one-sho t search set retain the best settings dete r mined in the initial o ne-shot conﬁguratio n tuning process. The non-cur r ent parameters in the g reedy search set, G , are assigned be st settings in one of two ways. If the non -curren t param eter has alrea dy b e en processed by the greedy search optimization process, then the parameter is assigned the b est setting ob tained f rom that p r ocess. If the non-cu rrent par a meter has not b een processed yet, then the parameter is assigned the b e st setting obtained f rom the initial one-sho t search co nﬁguration tuning pro cess. 8 Algorithm 4: Greedy Search Input: P - List of T unable Parameters, D - Signiﬁcan ce of Parameters towards Objec tive Function, B - Set of Best Settings for On e-shot and Exh austi ve Search, E - Set of Parameters for E x haustive Search, G - Set of Parameters for Gree d y Search Output: B - Comp le te set of Best Settings 1 s G = ∅ 2 δ s G ′ = ∅ 3 G P = ∅ 4 for k = 1 to m do 5 F k s b = ∞ 6 for i ← 1 to n do 7 if P i ∈ G k then 8 if D k i < 0 then 9 G P = sortDescen ding ( P i ) 10 end 11 for j ← 1 to n do 12 if P j 6 = G P then 13 δ s k G ′ P = δ s k G ′ P ∪ { B k j } 14 end 15 end 16 for l ← 1 to L i do 17 s k G = δ s k G P ′ ∪ { G P l } 18 Explore k th benchm a rk using conﬁgur ation s k G 19 Calculate F k s G 20 if F k s G < F k s b then 21 F k s b = F k s G 22 B k i = G P j 23 else 24 break 25 end 26 end 27 end 28 end 29 end The par tial test design conﬁg uration δ s G P ′ is th en comb ined with the settings f or the cur rent p a r ameter b eing processed to fo rm the comp lete simulatable test d e sign conﬁgu r ation s G (line 17). Th is conﬁguratio n is evaluated on th e cycle- accurate simulator . The re su lting ob jectiv e function , F s G , is compare d with the be st objec ti ve fun ction F s b , which holds the smallest value ob jectiv e function encou ntered thus far in the search pr ocess. Similar to th e exhaustive search proce ss, when a design co n ﬁguration r e sults in an objective function that has a value less than F s b (line 20), the n F s b is chan ged to the new minimum value and the set of best setting s B k i is updated with the correspo nding design con ﬁg uration. However , wh en the search p r ocess en counters a design conﬁguratio n that results in an objectiv e function that h as a value greater th an F s b , then the search p r ocess for the curren t par ameter is terminated and the next parameter in the parameter list G is explored. V . E X P E R I M E N TA L S E T U P W e used the ESESC [23] (En hanced Sup er ESca lar ) simulator to simulate all the test micr o architectur e conﬁgur ations genera te d by our meth o dology . The ESESC simulator is a fast cycle-accurate chip multipro cessor simulator . It mode ls an out-o f-order RISC (Red u ced Instruction Set Comp uting) p rocessor ru nning ARM instruction set. W e used benchm arks f rom the P ARSEC and SPLASH2 [24], [2 5] ben c hmark suite to test our methodolo gy . T he P ARSEC an d SPLASH2 ben chmark suite is a co llec tion of standardized benchm arks which pr ovides a diverse range of workloads for ev aluation of pro cessors. W e used the following bench marks from the P ARSEC and SPLASH2 suite to test our meth o dology . P ARSEC Benchmarks : Black scholes, Canneal, Facesim, Fluidanimate, Freqm in e, x 264 SPLASH2 Benchmarks : Cholesky , FFT , LU cb, LU ncb, Ocean cp, Ocean ncp , Radiosity , Radix, Raytrace The method ology phases were imp lemented using PERL [26]. Th e resu lts from th e simu lation pr o cesses were collected in MS Exc e l using Excel-Writer-XLSX [27] tool for PE RL. W e tested ou r d esign spa ce exploration metho dology separately for lo w-power and high-p erforma n ce p rocessor design. W e combined th e microarchitecture conﬁguratio ns obtained f rom th ese tests to form a two-tiered heterogen eous processor architectur e. Th e microarchitec tu re conﬁgur ation obtained fr om the low-power processor design tests wer e used to imp lement the low-power op timized interface processors, the lower tier of the two-tiered architecture. The micro a rchitecture conﬁgura tio n obta in ed from the hig h- perfor mance processor design tests wer e u sed to implemen t the high -perfo rmance optim ized host pro cessor , the upp er tier of the two-tiered arch itecture. T ABL E II M I R C OA R C H I T E C T U R E C O N F I G U R A T I O N PA R A M E T E R S E T T I N G S S E T Parame ter Name Set of Settings Low-P ower High-Perf ormance Cores 1, 2, 4 2, 4, 8 Frequenc y (MHz) 75, 100, 125, 150 1700, 2200, 2800, 3200 L1-I Cache Size (kB) 8, 16, 32, 64 8, 16, 32, 64, 128 L1-D Cache Size (kB) 8, 16, 32, 64 8, 16, 32, 64, 128 L2 Cache Size (kB) 256, 512, 1024 256, 512, 1024 L3 Cache Size (kB) 2048, 4096 2048, 4096, 8192 The list of microarchitec ture parameter s considere d for testing our methodo logy along with the set of possible settin g s for each param eter is listed in T able II. W e used d ifferent ra nge of settings f or low-power and high -perfo r mance pro cessor design. The range of settings listed in T ab le II under low- power d esign were used for the design of low-po wer o ptimized interface p rocessors. The design space cardinality for low- power processor design was 1,1 52 conﬁg u rations. The range of setting s listed in T able II under high-p erforman ce design were used fo r the d e sign of high-p e rforman ce o ptimized host processor . T he design space cardinality fo r high-p erforman ce processor design was 2 , 700 conﬁgur ations. 9 T ABL E III W E I G H T S F O R D E S I G N M E T R I C S Conﬁguration Powe r Perf ormance Low-Po wer 0.9 0.1 High-Performa nce 0.1 0.9 W e u sed power and per formanc e as design metrics to ev a lu ate the microarch itecture co n ﬁguration s for both low- power and high-p erforman ce op timized processors. W e u sed normalized value of total dynamic power an d leakage power [28] across all th e cores in the pro cessor as th e power metric and the normalized value of total ex ecution time as the perfor mance metric. W e used the weights presented in T able III to specify the p reference for the conﬂicting design m etrics of power and perfor mance. The linear objective functio n u sed for the e valuation of the test m icroarchitectu re conﬁg urations was: F = w P · P + w E · E (15) where, P = D y namic P owe r + Leak e d P ower E = T otal E xecution T im e (16) V I . R E S U LT S In this section, we pr esent the r esults ob tained while testing our metho dology . This section is d i vided into two sub sections. In the ﬁrst sub section, we present results to validate our design space explo ration me th odolog y and in the second subsection, we discuss some of the ap plicability of som e of th e microarch itecture co n ﬁguration s to importan t Io T use cases. A. E valuation of design space explor ation methodo logy For evaluating our meth odolog y , w e co mpared our microarch itecture conﬁgu ration re su lts with tho se obtained from a fully exhausti ve search of the d esign space. W e tested our method ology with an exploration thr eshold of T = 1 50. This threshold value is an upper b ound which lim its the par tial state space f or th e exhau sti ve search p hase of o ur m ethodolo gy . 1) P arameter signiﬁcan ce: Figure 3 shows the normalized values of p arameter signiﬁcance for different P ARSEC benchm a rks. The nor malization is carried out using the maximum values fo r total p ower and total execution time obtained in th e initial on e-shot searc h con ﬁguration tu n ing process. The p arameter signiﬁcan ce values are calculated in the ﬁrst phase o f our m ethodolo gy , initial one-shot search conﬁgur ation tun in g. W e obser ve th a t the signiﬁcance of each o f the tunable processor design parameter s varies b a sed on the type o f work lo ad offered by the test benchmark s. For each of the test b enchmark s, there are at most th ree signiﬁcant processor d esign para m eters. W e n ote that the operating frequen cy is the processor design par ameter with the highest sign iﬁcance for most of the test b enchmark s followed by core cou nt, which is the secon d most signiﬁcant d esign parameter . For c e rtain test benchm a r ks, the size of the L1-I cache and L1-D cache are also h ighly signiﬁcant to ov erall design. The large signiﬁcanc e in cache sizes is a result o f large work in g sets with ﬁne data-par allel granu larity offered by those test benchma r ks. Blackscholes Canneal F acesim Fluidanimate Freqmine x264 Cores Frequency L1−I Cache L1−D Cache L2 Cache L3 Cache P arameter Significance |D| 0.0 0.2 0.4 0.6 0.8 1.0 Fig. 3. Signiﬁcance of microarc hitect ure conﬁguration paramet ers for P ARSEC benc hmarks for high-perfor mance optimize d processor for IoT 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 T otal P ower Ex ecution Time x264 pareto front Objective function line P oint of intersection Fig. 4. L inear objec ti ve function plotted with Pareto front for x264 (P ARSEC) benchmark for high-performance optimi zed processor for IoT 2) Selecting a favorable tradeoff solution : Figu re 4 shows the Pareto front ob tained for x26 4 ( P ARSEC) benchma rk for high -perfo rmance optim iza tion requir ement. Th e Pareto front is generated using th e nor malized values of total power and execution time design metrics. Th e front represen ts the conﬂicting interdepen dency be twe e n power and perfor mance in a pro c essor . It sho ws th at inc r easing the perform a nce of a processor degrades its power efﬁciency whereas increasing power efﬁciency d egrades performan ce. It is thus impossible to deter mine a micro architecture co nﬁguration which results in b oth these m etrics having optim al values. The go al of the de sig n space exploration method ology is to determin e a balance between these co nﬂicting d esign metr ics. A suitable tradeoff between these metrics is selected by u sing th e preferen ce speciﬁed using the weigh ts assigned to each metric. In our experimen ts, we speciﬁed w P and w E as the weights for p ower and perform a nce metrics respectiv ely to d e ﬁne a linear o bjectiv e fun ction (Equ ation 1 5). Figure 4 shows th e objective function plotted along with th e Pareto f ront. W e note that the objective fun ction fo r ms a straight line in the power - 10 T ABL E IV C O M PA R I S O N O F M I R C OA R C H I T E C T U R E C O N FI G U R A T I O N S O B TA I N E D F O R X 2 6 4 ( PA R S E C ) B E N C H M A R K F O R H I G H - P E R F O R M AN C E O P T I M I Z E D P R O C E S S O R F O R I O T Parame ter Name Micr oarc hitectur e Conﬁguration Proposed Fully Exhausti ve Methodol ogy Search Cores 2 2 Frequenc y (MHz) 3200 3200 L1-I Cache Size (kB) 64 128 L1-D Cache Size (kB) 64 128 L2 Cache Size (kB) 1024 256 L3 Cache Size (kB) 2048 8192 T otal Power (W) 1.597 1.600 Executio n Time (ms) 35.142 34.152 perfor mance gr aph with the slope − w P /w E . W e ob serve that the objectiv e func tion is tangent to the Pareto fro nt at the p ower -perform ance value p air of th e micr oarchitectur e conﬁgur ation ob tained as solution b y our method ology . 3) Comparison with fully exhaustive sea r ch: W e veriﬁed the micr oarchitectur e conﬁgu ration obtained a s solution from our m e th odolog y by c o mparing it against the solution ob tained by runn ing a fully exhaustive search of the d esign space . W e presen t a comparison of the x264 (P ARSEC) benchm ark as an exam ple in T able IV. T h e table shows a side-by-side compariso n of the micro architecture co nﬁguratio n s obtained from ou r p roposed methodolo gy with the same o btained from fully exhaustiv e search. Compar ing th e se values, we see that signiﬁcant param eters like op erating frequen cy and core-c ount match exactly while other parame ter s only dif fer slightly . The table also con ta in s the v alues of the total power and execution time obtain ed for both con ﬁg urations. Com paring the values of these d esign m etrics, we see that the to tal power and execution time values ob tained from our method ology are within - 0.18% and 2.89% respectiv ely of the to tal power an d execution time values ob tained from fu lly exhau sti ve exploratio n. Using o ur method ology , o n average we ach ie ve microarch itecture con ﬁguration s w ith total p ower values within 2.2 3% for low-power optim ized processor and execution time with in 3.69% fo r h igh-per f ormance optimized processors as compared to fully exhaustive search. T hese conﬁgur ations are obtain e d by exploring on ly 3%–5% of th e processor design sp a ce which results in our meth odolog y having an average sp e edup of 24. 1 6 × as co mpared to fully exhaustiv e explo ration o f the design space . B. A pplication scopes in IoT Based on the ty pe and size of workload offered b y the test benchm arks, we separate them into four different categories each of which r elates to an Io T application or process. T a ble V shows the c ategorization o f some o f the key test ben c hmarks. Th e Cho lesky and Radix ben c h marks from the SPLASH2 b enchmar k suite are categor ized un der data sensing and aggregation. The Cholesky be n chmark is a sparse matrix factorization kern e l and the Radix benchm ark is an integer sor t kernel [29]. Th e Cho lesky benchmark is representative of data sen sin g in IoT application s, where data is acquired from m ultiple sensor sou rces and transformed in to T ABL E V C AT E G O R I Z AT I O N O F T E S T B E N C H M A R K S AC C O R D I N G T O I O T A P P L I C A T I O N IoT A pplicati on Benchmar ks Data sensing and aggre gation Cholesk y , Radix Data analysis and Data mining Blacksc holes, Freqmine Graphics Face sim, Fluidanimate Signal processing and Communicat ion FFT T ABL E VI M I R C OA R C H I T E C T U R E C O N F I G U R A T I O N S F O R L O W - P OW E R O P T I M I Z E D P R O C E S S O R S F O R I O T Parame ter Name Micr oarc hitectur e Conﬁguration Cholesky Radix Cores 1 1 Frequenc y (MHz) 75 75 L1-I Cache Size (kB) 8 8 L1-D Cache Size (kB) 32 64 L2 Cache Size (kB) 256 2 56 L3 Cache Size (kB) 20 48 4096 T otal Power (W) 0.0934 0.0935 Executio n Time (ms) 327.958 332.535 a more useful format. The Radix benchm ark is repr esentativ e of data aggregation, wher e indexing, sorting and sto ring operation s are c a rried out on sensed data. These benchmar k s are usefu l in deter mining the micro architecture conﬁgura tio ns of lo w-power optimized interface processors for the two-tiered heteroge n eous processor architecture. The rem aining categories all mode l more com plex applications requ ir ing hig h lev el of proc e ssing capa bilities. The Blackscholes and Freqmine be nchmark s fr om the P ARSEC ben chmark suite are listed under d a ta an a ly sis and data mining. The Blackscholes ben chmark is a ﬁnancial analysis benc h mark that analytically solves large sets of partial differential equatio ns [24]. The Freqmin e bench mark is a data mining kernel which impleme n ts Freq uent Itemset Minin g [24]. These b enchmar ks are representative of data an alysis and ﬁltering oper ations that need to be carried out o n large volumes of sensor data in an IoT network. The Facesim and Fluidan imate bench marks fr om the P ARSEC benchmark suite are listed u n der graph ics. Th e Facesim bench mark gene r ates a visua lly realistic mo del o f a human face an d the Fluidan imate ben c hmark simulates an incompr essible ﬂuid for inter a cti ve anim ation pu rposes [24]. Graphical applications are impo rtant in I oT objects which need to interact with users via grap hical u ser interfaces. The FFT bench mark from the SPLASH2 benc hmark suite is listed under signal proce ssing an d c o mmunicatio n. The FFT benchm a rk is an impleme n tation of Fast Fourier Transform algorithm which is optimized to minim ize in terprocess commun ication [2 9]. Signal processing and c o mmunicatio n is one of the mo st com mon ap plications in an IoT network. FFT is an important Digital Sig nal Processing (DSP) algorithm which is re q uired in co mmunicatio n of data over Software Deﬁned Radios (SDR) [14]. These ben chmarks, which req uire higher p rocessing capabilities, are useful in determining the mic r oarchitectur e conﬁgur ations of high- perfor m ance optimize d host pro c essor for the two-tiered heterogeneo us processor architectur e. 11 T ABL E VII M I R C OA R C H I T E C T U R E C O N F I G U R A T I O N F O R H I G H - P E R F O R M A N C E O P T I M I Z E D P R O C E S S O R S F O R I O T Parame ter Name Micr oarc hitectur e Conﬁguration Blackschol es F reqm ine Face sim Fluidanimate FF T Cores 8 2 2 2 4 Frequenc y (MHz) 3200 32 00 3200 3200 3200 L1-I Cache Size (kB) 64 32 8 8 128 L1-D Cache Size (kB) 1 28 12 8 64 64 32 L2 Cache Size (kB) 256 1024 1024 1024 512 L3 Cache Size (kB) 8192 20 48 8192 4096 2048 T otal Power (W) 4. 549 1.565 1.546 1.546 2.563 Executio n Time (ms) 28.1239 67.319 60.072 55.605 29.986 1) Micr oarc hitectur e conﬁg u rations for low-po wer optimized pr ocessors fo r IoT : T able VI shows the microarch itecture con ﬁgurations o btained for Cholesky and Rad ix b enchmar ks from the SPLASH2 benchmar k suite. In these con ﬁguration s, we note that for low-power optimized processor, the lo west operating f r equency and core count are selected. This r esult can b e interp r eted in tuitiv ely , b e cause high o perating frequ e n cy an d high nu mber of c o res in the processor inc r eases the p ower consum ption of the processor . W e also note that these co n ﬁguration s h av e large L1-D cach e sizes. Th is is because of the large workload offered by the test ben chmarks. This is rep resentativ e of the growing IoT ecosystem in which large volumes of d a ta are g athered fro m a large num ber o f sensing elements. The values of total power and execution times for micro architecture conﬁgu r ations are also shown in T able VI . W e observe that the power values are in the rang e of a hun dred m illiwatts and the execution time is in the ra nge of a fe w hun dred m illiseconds. These values are within the operatio nal r equiremen ts in most IoT d eployments. These conﬁg u rations imp lement the inter face proc e ssors in the two-tiered heter ogeneou s pro cessor architecture. W ith low-po wer requirem ents, these p r ocessors can alw ays be operated in active mode, with out impacting th e p ower budget of IoT d eployments 2) Micr oarc hitectur e conﬁ guration for h igh-performan ce optimized pr ocessors for Io T : T able VII sho ws the microarch itecture conﬁgu rations ob tained fo r Blacksch oles, Freqmine, Facesim and Fluidanimate benc hmarks f r om the P ARSEC ben chmark suite and the FFT b e n chmark from the SPLASH2 b enchmar k suite. W e analyze th e m ic r oarchitectur e conﬁgur ations obtained for these test b enchmark s according to the categorization discu ssed in sub section VI -B. W e observe th a t fo r data an alysis an d data m in ing ap plications, represented by the Black scholes and Fr e qmine benc hmarks, higher performan ce is achieved primarily b y th e increase in o perating frequ e ncy . W e no te that the size of th e L1-D cache for these application s is also high, which is because both are h ighly data-p arallel benchmar ks. The size of the L2 cache, for Black scholes, and, L3 cache, for Fr eqmine, is also high which is also a resu lt of data-p a rallelism in these benchm arks. For graph ics applications, rep resented by Facesim and Fluida n imate ben chmarks, higher pe r forman c e can again be attributed to incr ease in operating frequ ency . These benchmark s a r e also highly data-parallel w h ich exp lain s the large L 1-D cach e, L2 c a c he and L3 cach e in th e re su lting microarch itecture conﬁgu rations. In signal processing and commun ication applications, represented by FFT b enchmar k, perfor mance improvement, similar to other applicatio ns, is attained b y incr ease in operating fr equency . Howe ver , FFT requires a larger instruction cac he as compar e d to larger data caches for other app lications. Higher L1-I cache could be a result of the FFT bench mark bein g optimized for lo w interpro c ess comm unication. The total power and execution time o f each microarch itecture conﬁguration is also listed in T able VII. These conﬁguratio ns hav e h igh total power values in the range of one to a few watts but signiﬁcantly low execution time values in the r ange of f ew tens o f milliseconds. Th ese conﬁgur ations implement the host proc e ssor in the two-tiered heteroge n eous processor architec ture. Due to th e ir high-p ower requirem ent, these processors are mostly kept in sleep mode and are activated intermittently for shor t durations to save energy and prolon g battery life. Because these processors have shorter execution times, they can execute th eir tasks quickly and g o to sleep thus, decr easing the du r ation that they are active. V I I . C O N C L U S I O N A N D F U T U R E W O R K In this pap er , we proposed a tem porally e fﬁ cient design space exploration methodolo gy for selecting micr oarchitectur e conﬁgur ations o f processors f or IoT . Our exploration methodo logy con sisted of four ph ases. In the ﬁrst p h ase, we determined best initial settings for tu nable pro cessor design parameters using initial on e-shot search m ethod. W e also calculated the signiﬁcance of each desig n parameter o n the overall design in this ph ase. T he results of this p h ase we re used in the second phase to separate th e processor design parameters into distinct search sets using an exploration threshold value supp lied b y the system designer . The third and the fo urth phase of the metho d ology implemented exhaustiv e and greedy search methods to prune these search sets to determine the best microarchitectur e conﬁgu ration o f the processor . W e tested our method ology over two design spaces, one for deter mining low-power op timized and the other for determinin g high-p erforma nce optimized p rocessors for I oT . W e v alidated the results obtain ed from o ur meth odolog y by co mparing with solutions obtained from fully exhau sti ve exploration of th e design spaces. Our results revealed that our methodo logy obtained microarch itecture conﬁg u rations close to within 2 .23%–3. 69% of the con ﬁgurations obtaine d from 12 fully exhaustiv e search. Our methodo lo gy only explored 3%– 5% of the overall design space to determine these h igh qu ality solutions. T his resulted in 24.16 × average speedup o n design space explor a tion as c o mpared to th e time requ ir ed for fully exhaustiv e explo ration. W e also described a two-tiered h eterogen eous proce ssor architecture for inco rporating power-optimized p erforma nce in IoT o b jects. W e u sed the results o btained from the ev a luation of our design space explo ration metho dology to describe the two-tiered arc hitecture. W e categorized the test benchm arks into four different categories, relating them with possible IoT use cases and analyze microar chitecture conﬁgur a tions determined fo r these benchm arks to make o ur assertions on processors for I oT objects. W e determined that f or low-power optimization , microar chitecture co nﬁguration s with lower core count and lower ope rating freq uency are more suitable. For high-p erforman ce optimization , improvement in per formanc e primarily results f rom incr ease in op e r ating frequency . W e also analyzed th e cache hier archy for dif ferent micro architecture conﬁgur ations an d relate d them with th e type and size of workloads offered b y the test benchmark s. In the future, we plan to investigate microar c hitecture conﬁgur ations of ultra-low power processor s for I oT . W e also intend to test our d esign space explo ration methodo logy using stan dard I oT be nchmark s. W e also aim to imp rove o ur methodo logy by in corpor a tin g better optim iz a tion techniques like g enetic an d evolutionary algo rithms and machine - learning. W e also plan to study the practical applicability of th e two- tiered h e terogeneo us p rocessor m odel fo r pro cessors for IoT objects, and , com pare the model with pro cessor architecture models curr ently in use in the IoT market. R E F E R E N C E S [1] J . Chase, “The evol ution of the internet of things - from connected things to livin g in the data, preparing for challeng es and IoT readiness, ” T exas Instruments, T ech. Rep., Sep 2013. [2] (2015, Nov) Gartner says 6.4 billion connected ”thi ngs” will be in use in 2016, up 30 perce nt from 2015. [Online]. A vail able: http:/ /www .gartner .com/ne wsroom/id/3165317 [3] S . Bath. (2016, Aug) Dev elop ing solutions for the internet of things. [Online]. A vail able: https: //goo.gl/ECy cve [4] “Dev eloping soluti ons for the internet of things, ” Intel, T ech. Rep., 2014. [5] S . Matalo n, R. Klein , and C. W alls, “Embedded system power consumption : A softwa re or hardwa re issue?” Mentor Graphics, T ech. Rep., Jun 2011. [6] “Intelligent ﬂexib le IoT nodes, ” ARM, T ech. Rep., Oct 2015. [7] Y . V eller and S. Matal on, “Why you should optimize power at the elect ronic system lev el, ” Mentor Graphic s, T ech. Rep., Aug 2010. [8] C. Rommel, “ Architect ing s uccess with heterogeno us systems, ” VDC Researc h, Mentor Graphics, T ech. Rep., 2016. [9] “IoT opportunity demands new approach to mcu-based embedde d designs - rapidly moving m ark et requires inte grated silicon/sof tware platform, ” Renesas and Synerg y , T ech. Rep., Oct 2015. [10] J. Brank e, K. Deb, K. Mietti nen, and R. Slo w inski, Multiobj ectiv e Optimizati on - Interactiv e and Evolution ary A ppr oaches . V erlag Berlin Heidelb erg: Springe r , 2008. [11] S. Boyd and L. V andenbe rghe, Con vex Optimization . Ne w Y ork, NY , USA: Cambridge Univ ersity Press, 2004. [12] K. Char , “Interne t of things system design with integrate d wireless MCUs, ” Silicon Labs, ARM, T ech. Rep., Oct 2015. [13] J. Geuzebroek and A. V aassen, “Buil ding an ef ﬁcient , tight ly coupled embedded system using an exte nsible processor , ” Synopsys, T ech. Rep., Jun 2014. [14] T . Ade gbija, A. Rogacs, C. Pat el, and A. Gordon-Ross, “Enabling right- provi sioned microprocessor archit ecture s for the intern et of things, ” in ASME Pr oceed ings of Internati onal Mec hanical Engineerin g Congr ess and Exposition , Houston, T exas, USA, Nov 2015. [15] J. Michanan, R. Dewri, and M. J. Rutherford, “Understandi ng the po wer- performanc e tradeof f through pareto analysis of liv e performan ce data, ” in Pr oceedi ngs of Internati onal Gr een Computing Conferen ce (IGCC) , Dallas, T exas, USA, Nov 2014. [16] Q. Guo, T . Chen, Y . Chen, Z.-H. Zhou, W . Hu, and Z. Xu, “Effe cti ve and ef ﬁcient microproc essor design space explorat ion using unlabeled design conﬁgur ations, ” ACM T ransaction s on Intellig ent Systems and T echnolo gy , vol. 5, no. 1, pp. 20:1–20 :18, Jan 2014. [17] M. Monchi ero, R. Canal, and A. Gonzalez , “Powe r/performanc e/the rmal design-spac e explorati on for multicore architec tures, ” IEEE T ransactions on P arallel and Distributed Systems , vol. 19, no. 5, pp. 666–681, May 2008. [18] T . Giv arg is and F . V ahid, “Pla tune: A tuning frame work for system- on-a-chi p plat forms, ” IE EE T ransactio ns on Computer-Aided Design of Inte grated Circ uits and Systems , vol. 21, no. 11, pp. 1317–1327, Nov 2002. [19] M. Pale si and T . Giv argis, “Mult i-objec ti ve design s pace exploratio n using genetic algorithms, ” in Pro ceedin gs of the 10th International Symposium on Hardwar e/Softwar e Codesign (CODES) , Estes Park, CO, USA, May 2002. [20] C. Sil v ano, W . Fornaciari , G. Pal ermo, V . Zaccaria, F . Castro, M. Martinez, S. Bocchio, R. Zafalo n, P . A va sare, G. V anmeerbee ck, C. Ykman-Couvr eur , M. W outers, C. Kavk a, L. Onesti, A. T urco, U. Bondi, G. Mariani, H. Posadas, E . V illar , C. Wu, F . Dongrui, Z. Hao, and T . Shibin, “MUL TICUBE: Multi-obje cti ve design space explor ation of multi-core archit ectures, ” in Pro ceedin gs of IEEE Computer Societ y Annual Symposium on VLSI (ISVLSI) , Lixouri, Kefal onia, Jul 2010. [21] A. Munir , A. Gordon-R oss, S. L ysecky , and R. L ysecky , “ A lightwe ight dynamic optimiza tion methodo logy and appli catio n metrics estimati on model for wireless sensor netw orks, ” Sustainable Computing: Informatic s and Systems , vol. 3, no. 2, pp. 94 – 108, Jun 2013. [22] D. Sheldon, “Design s pace explora tion of parameteriz ed s ystems using design of expe riments, ” Ph.D. dissertation, Department of Computer Science , Dec 2011. [23] E . K. Ardestani and J. Renau, “ESE S C: A fast multicore simulator using time-ba sed sampling, ” in Pro ceedin gs of IEE E 19th Internati onal Symposium on High P erformanc e Computer A r chit ectur e (HP CA ) , W ashington, DC, USA, Feb 2013. [24] C. Bienia, “Bench marking modern multiprocessors, ” Ph.D. dissertation, Departmen t of Computer Science, Jan 2011. [25] Y . Bao, C. Bienia, and K. Li, The P ARSE C Benchmark Suite T utorial - P A RSEC 3.0 , San Jose, CA, USA, Jun 2011. [26] (2015) Perl reference . [Online]. A vaila ble: http://p erlmav en.com/ [27] J. McNamara. (2015, Apr) Excel-writer -XLSX. [Online]. A va ilabl e: http:/ /search.c pan.org/ dist/Excel- W riter-XL SX/ [28] A. F . Lorenzon, M. C. Cera, and A. C. S. Beck, “On the inﬂuence of stati c power consumption in multico re embedded systems, ” in 2015 IEEE International Symposium on Cir cuits and Sytems (ISCAS) , Lisbon, Portugal , May 2015. [29] S. C. W oo, M. Ohara, E. T orrie, J. P . Singh, and A. G upta, “The SPLASH-2 progra ms: Characteri zatio n and methodolog ical considera tions, ” in Pr ocee dings of 22nd Annual Inte rnational Symposium on Compute r Arc hitec tur e (ISCA) , Santa Marg herita Ligure, Italy , Jun 1995. Prasanna Kansakar is a PhD student in the Departmen t of Computer Science (CS) at Kansas State Univ ersit y (K-Stat e), Manhat tan, KS. His research intere sts include Internet of Things, embedded and cyber-ph ysical systems, computer archit ecture , multicore, secure and trustworthy systems, and hardware-base d securi ty . Kansakar has an MS de gree in computer s cienc e and engineeri ng from the Unive rsity of Ne v ada, Reno (UNR). He is a student m ember of the IEEE . 13 Arslan Munir is currently an Assistant Professor in the Department of Computer Scienc e (CS) at Kansas State Univ ersity (K-State). He holds a Michelle Munson-Serban Simu Ke ystone Researc h Faculty Scholarshi p from the College of Engineering. He was a postdoctoral research associa te in the E lect rical and Computer Engineering (ECE) departme nt at Rice Unive rsity , Houston, T e xas, USA from May 2012 to J une 2014. He recei ved his M.A.Sc. in E CE from the Uni ve rsity of British Columbia (UBC), V ancouver , Canada, in 2007 and his Ph.D. in ECE from the Uni ve rsity of Florida (UF), Gainesvi lle, Florida, USA, in 2012. From 2007 to 2008, he worked as a software dev elopmen t enginee r at Mentor Graphics in the Embedded Systems Di vision. Munir’ s current researc h interests include embedded and cybe r- physical systems, secure and trustwort hy systems, hardware-b ased security , computer architec ture, multicore, parallel computing , distrib uted computing, reconﬁgura ble computing, artiﬁcial intell igence (AI) safety and security , data analyt ics, and fault toleranc e. Munir recei ved m any acade mic awa rds includi ng the doctoral fellowshi p from Natural Sciences and Engineering Research Council (NSERC) of Canada. He earned gold medals for best performance in elect rical engineeri ng, gold medals and acade mic roll of honor for securin g rank one in pre-engineeri ng prov incial examin ations (out of approxi mately 300,000 candidat es). He is a Senior Member of IEEE.

Selecting Microarchitecture Configuration of Processors for Internet of Things

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment