您的当前位置：首页正文

Efficient Statistical Performance Modeling for Autonomic, Service-oriented Systems

来源：星星旅游

EfﬁcientStatisticalPerformanceModelingforAutonomic,

Service-orientedSystems

RuiZhang

OxfordUniversityComputingLaboratory

WolfsonBuilding,ParksRoad,OxfordOX13QD,Englandruiz@comlab.ox.ac.uk

AlanBivens

IBMT.J.WatsonResearchCentre

19SkylineDrive,

Hawthorne,NY10532,USA

jbivens@us.ibm.com

IeadRezek

OxfordUniversityEngineeringScienceDepartment,

ParksRoad,OxfordOX13PJ,England

irezek@robots.ox.ac.uk

ABSTRACT

Asservice-orientedenvironmentsgrowinsizeandcomplexity,managingtheirperformancebe-comesincreasinglydifﬁcult.Toassistadministrators,autonomictechniqueshavebeenadoptedtopermittheseenvironmentstobeself-managing(problemlocalization,workloadmanagement,etc.).Thesetechniquesneedasenseofsystemstateandtheabilitytoprojectanewstategivensomechangewithintheenvironment.Recentworkaddressingthisissuefrequentlyusedstatisti-callylearnedmodelswhichlearnedpatternsfromscratch.However,manyenvironmentsalreadyhavemanagementfacilitiesinplacethatcouldprovidepreciseandusefulinsights(e.g.workﬂows)intothesystem.Thispaperintroducesamethodofmodelingservice-orientedsystemperformanceusingBayesiannetworksandspeciﬁcallyaddressesthebeneﬁtsobtainedbyincorporatingtheseinsightsintothemodellearningprocess.Tofurtherminimizemodelbuildingcosts,wedeviseadecentralizedmethodtoconcurrentlylearnpartsofthemodelwhereknowledgeinclusionisim-possibleandmustbeinferredfromdata.Simulationsandapplicationsinactualenvironmentsshowsigniﬁcantreductionsinlearningtime,betteraccuracyandstrongertolerancetosmalllearningdatasets.

KEYWORDS

Responsetimemodeling,Autonomicperformancemanagement,Bayesiannetworks,Gridcom-puting

1Introduction

Service-orientedcomputing[14]facilitatesdistributedapplicationdevelopmentacrossorgani-zationalandgeographicalboundariesthroughthepromotionofself-describingandopensoftwarecomponents.Intheseenvironments,userrequeststraversemultipleheterogeneouscomponentsdrawntogetheron-the-ﬂy,makingitevermorechallengingtomeetend-to-endqualityofservice

(QoS)(e.g.responsetime)goals.Therefore,autonomousmanagement[9]softwareinservice-orientedenvironmentsrequiresamodeltocapturethecomplexsystemdynamicsandlinkindivid-ualcomponentbehaviorstoend-to-endperformance(typicallyresponsetime)states.Thismodelmustaddresstwochallenges:1)itmustprovideaccurateguidancetoautonomicactivitiessuchasresourceprovisioning,loadbalancing,andperformanceproblemlocalizationandremediation,and2)duetothedynamicnatureofautonomicenvironments,themodelmustbereconstructedfromtimetotime,removingpastandobsoleteinformationtoreﬂectmostrecentsystemstates.Thisprocessmustbeaccomplishedatlittlecostwithminimalhumanintervention.

Muchworkhasbeendoneinanalyticalmodelingofcomputersystemperformance.Classicperformancemodelingtheoriessuchasqueueingnetworks,petrinetsandcontroltheoryhavebeenappliedtorealworldsystems[19,6,10].Althoughthesemodelsarestableandmathematicallysound,theymaybeverydifﬁculttoderivemanuallyfromcertaindomains.Themodelingprocesscanrequiresubstantialeffortfromhumanexpertsandbesubjecttoinadvertenthumanmistakes.

Morerecently,somegroupshavecreatedtechniquesthat“train”modelsfromperformancedatacollectedthroughinstrumentation/proﬁling[8,1].Thesestatisticallearningapproachesdonotassumehumaninvolvementandrequirelittledomainanalysis.Whilepromising,suchpracticesarestillatanearlystage,andcanbecomputationallyveryexpensiveandratherdata-sensitive.

Motivatedbythecontrastingnaturesofanalyticalmodelingandstatisticallearningapproaches,thispaperattemptstoharvestthestrengthsofbothbyencodingexistingdomainknowledgeintothestatisticallearningframework.Inthismanner,notonlycanthevulnerabilitiesofanalyticalmodelingbecompensatedbyinformationextractedfromthedata,butthecostanddata-sensitivityofstatisticallearningcanbereducedbyinformativedomainknowledgeguidance.Inparticular,aknowledge-enhancedBayesiannetisproposedtomodeltheend-to-endresponsetimeofservice-orientedsystems.Themodelleveragesreadilyavailabledomainknowledgeanddecentralizedlearningtominimizemodelconstructioncostwhileoptimizingaccuracy.Thatis,byincorporatingdomainknowledgetheBayesiannetworkstructureencodesthecausalrelationshipsbetweenthesystemcomponents.

Thisworkhasleadtothefollowingresearchcontributions:•Usingeasilyattainabledomainknowledge,weestablishtheBayesiannetworkresponsetimemodelwhileeliminatingtheneedforstructurelearningandeasingtheexerciseofparameterlearning.

•Whennoadequatedomainknowledgecanbeutilized,weformulateanovelwaytoconcur-rentlylearnmodelparametersatdistributedlocationsfromlocaldata,thusfurtherdiminish-ingmodelconstructioncosts.

•Anautomatedimplementationoftheapproachisdeliveredtooperateunderaﬂexiblemodel(re)constructionschemesandcanbeintegratedintoautonomicsolutionswithminimaleffort.•Comprehensivesimulationsarepresenteddemonstratingthevastlyreducedmodelconstruc-tiontimeanddata-sensitivity.Tworeal-worldcasestudiesonaservice-orientedGridwerealsopresented.

Theremainderofthepaperisstructuredasfollows.Thesequelprovidesnecessarybackground.Section3presentstheknowledge-enhancedBayesiannetworkforend-to-endresponsetimemod-eling.TheapproachisexperimentallyevaluatedusingsimulationsinSection4.Tworeal-world

applicationsofthemodelaredetailedinSection5.Section6reviewsrelatedwork.Section7concludesanddiscussesfutureresearch.

2Performancemonitoringandmodelingforservice-orientedsystems

TheeDiaMoNDGrid[2],anOGSA-enabledfederateddatabaseofannotatedmammograms,willbeusedasareferencetoservice-orientedsystemsthroughoutthispaper.Figure1showsacommoneDiaMoNDscenario.Initaradiologistretrievessomemammogramsforanalysis.SixGridservicesareinvolved.Havingreceivedarequestfromtheradiologistclient,theimage

listservice,askingforinformationabouttheimagesassignedtothe

radiologist.SupposethatIDsandlocationsarereturnedoftwoimagesthatneedtobecompared.SincethetwoimagesarestoredinalocalhospitalLandaremotehospitalR,respectively,theimagelocatorserviceonbothsites.Thisleadstotheinvocationoflocalandremoteogsa

Management ServerInternetID_Rimage_RPerformance dataRequests/responsesimage_retrieveServiceeamernsU_RIDL, _IDID_LImage_LDB queryImage_LFigure1.PerformancedatareportinginaneDiaMoNDscenario.Monitoringinfrastructurecom-ponentsarenotshown.

Asisdescribedinpreviouswork[20],standardOGSA-basedmiddleware,areinstrumentedwithmonitoringpoints,whichmeasurethetimeelapsedatmiddlewarecomponents.Amonitoringagentresidesoneachmachine,listeningtothemonitoringpointsfordata,possiblybatchingthembeforereportingthemtothemanagementserver.

Thecollecteddataareusedtoperiodicallyconstructend-to-endresponsetimemodelson-the-ﬂy,soastoreﬂectlatestdynamicsintheservice-orientedenvironmentandassistautonomicmanagement.Althoughastrategywheremodelsareupdatedwiththelatestdatamayappearlessextreme,thedisperseofolddataisoftennotpossibleundercurrentstatisticalframeworks

suchasBayesiannetwork[18,7].Theresultisthattheseout-of-dateinformationlingersintheupdatedmodelandadverselyimpactsitsaccuracy,makingaschemepurelybasedonupdatesinadequate.Ahybridschemecombiningfrequentupdatesin-betweenreconstructionsmaybemoreappropriate,butisbeyondthescopeofthispaper,whichfocusesonthemorecomplexandcostlymodelconstructionprocedures.

Ingeneral,themodel(re)constructionshouldoccuratinterval,TCON,usingaslidingdatawindowW,allowingthemodeltousethedataofthecurrentintervalaswellas(K−1)previousintervals(seeEquation1).

W=K∗TCON(1)InEquation1,KisanEnvironmentalCorrelationMetricbasedontheautonomicnatureofthe

environmentandTCONisbepartlybasedontheintervalofdatacollectionasshowninEquation2.

Theenvironmentalcorrelationmetric,K,characterizeshowoftenautonomicactionsmayhap-penintheenvironment,renderingthecurrentenvironmentstatesunrelatedtotheirpasts.Envi-ronmentswhereradicalchanges(e.g.resourceallocation,failurerecoveryactionsetc.)happenmoreoftenwillhavealowerKvaluemakingthemodelconstructiononlyusemorerecentdata.Ifthingschangelessdramatically,alargerKvaluewillallowthemodeltoconsideralargerwindowofdata1.

TCON=αmodel∗TDATA

(2)

whereαmodelistheModelConstructionCoefﬁcientandTDATAistheDataCollectionInterval.Modelsthatcanbebuiltquicklywillhavealowerαmodelvalueindicatingthattheycouldbeappliedinenvironmentsdemandingmorefrequentmodelrecycling.TDATAishowoftenadatapointisreportedanddependentonthemonitoringinfrastructure.K∗αmodelgivesthenumberofdatapointsavailableforinferringthemodel.

3Knowledge-enhancedBayesiannetworkmodel

ThissectionpresentsaBayesiannetworkmodelforresponsetimesinservice-orientedsystems,whichisconstructedusingacombinationofreadilyavailabledomainknowledgeandperformancedatacollectedviainstrumentation(i.e.puttingmonitoringpointsintothesystemasisdescribedintheprevioussection).

3.1Mathematicalframework

ABayesiannetworkisusedtomodeltherelationshipbetweenelapsedtimeonservicesandend-to-endresponsetime,capitalizingonitssupportofautomaticmodelinferencefromdata.Thisformalismisalsochosenbecauseitreﬂectsthestochasticnatureoftheproblemwhilegraphicallyrepresentingthecausalityamongelapsedtimeonservicesandtheireffectsonend-to-endresponsetime.

DEFINITION1.AKnowledgeEnhancedResponseTimeBayesianNetwork(KERT-BN),isadirectedgraph(DAG)representingthejointdistributionP(D,X1,···,Xn)oftherandomvariablesXi,i=1···,n,theelapsedtimeofservicei,andrandomvariableD,theend-to-endresponsetime,suchthat

n󰀁

P(D,X1···Xn)=PD(D|Φ(D))PXi(Xi|Φ(Xi)).(3)

i=1

Here,Φ(Xi)andΦ(D)arethesetofparentsofnodeXi,i=1,...,nandD,respectively,

andwillbedeterminedbyworkﬂowandresourcesharingknowledgeinthenextsubsection.PXi(Xi|Φ(Xi)),i=1,2,...,nandPD(D|Φ(D))aretheConditionalProbabilityDistributions(CPDs)ofXiandD,respectively,anddescribethedependenciesamongserviceelapsedtimesandtheireffectontheend-to-endresponsetime(tobediscussedfurtherinSection3.3).

AKERT-BNcanbeeithercontinuousordiscrete,dependingontheapplicationneeds.SincecontinuousCPDs(typicallyGaussianCPDs)haverelativelyfewparameters,acontinuousKERT-BNallowsthemodeltoconvergemorequicklywhendataislacking.Thispropertyisattractiveinfastchangingenvironmentswheremodelreconstructionmustbeperformedveryfrequentlybeforealotofdatacanbegathered.Iftheenvironmentsarerelativelystable,adiscreteKERT-BNmaybeconsidered.Contrarytoacontinuousmodel,adiscreteoneimposesnoassumptionontheshapeoftheCPDsandislikelytoachievegreateraccuracywhenthereissufﬁcientdata.3.2Establishingthestructurewithworkﬂowandresourcesharing

ABayesiannetcanbebuiltintwosteps.Firstdeterminethestructure(i.e.theDAGwhichcapturesthestatistical/causaldependenciesbetweentherandomvariables)andthenobtainthepa-rameters(CPDs).WhilstautomatedmethodsforBayesiannetstructurelearningexist(i.e.methodsforﬁndingaDAGtopologythatbestﬁtsthedata[13]),theyaretypicallycomputationallyexpen-sive.Forinstance,aKERT-BNforasystemofnserviceswillhaven+1nodesandexponential(inn+1)numberofDAGstructures.SuchcomplexitymakesitintractabletoexhaustivelysearchforthebestDAG[4]inlargeenvironments.EvengreedyalgorithmslikeK2[5]needstoexploreO((n+1)2)possibilities.

Intuitively,dependencyispresentbetweentworandomvariablesifthereissomeconnectionbetweenthetwosuchthatchangeinonevariablecanbesomehowcommunicatedtotheothervariablewhereacorrespondingvariationmaybetriggered.Inthecontextofservice-orientedsystems,itappearsreasonabletoassumethatsuchconnectionsbetweentheperformance(elapsedtimeinthispaper)oftwoservicesiandjcanexistintwoforms:

•Whileanumberofrelationshipsmaybeinterpretedfromworkﬂowinformation,thispaperattemptstoreducethenumberofdependenciesconsideredbyonlyaccountingfordirectandimportantrelationshipsbetweenservicesandtheirimmediateupstreamservices.Ifoneservice(say,i)istheimmediateupstreamserviceoftheotherservice(j)intheworkﬂowgraph,classicperformancemodelingliterature[15]suggestsi’selapsedtimecanbetiedtoitsthroughput.Thuselapsedtimeofjisrelatedtoithroughj’sinputfromi(assumingthatj’sinputcandetermineitselapsedtime).Aburstini’sworkloadoraslowdowninitsserviceratemayaffectitselapsedtime,andisalsoreﬂectedbychangeinj’selapsedtime.•Thetwoservicesaresharingacommonresource(e.g.CPU,memory,network).Statusofthe

commonresourcecanbetiedtotheperformanceofbothservices,resultingindependencybetweenthem.

Boththeworkﬂowandresourcesharinginformationcanbecapturedbyexistingmonitoringin-frastructures[20]orarewelldocumentedatsystemdesignstage.ThisknowledgecanthenusedtodeterminethedependenciesamongKERT-BNelapsedtimenodesX1,...,Xnatlittlecost.Inthecaseofserviceiaffectingdirectlyservicejdownstream,aBayesiannetworkrepresentingtheworkﬂowbetweenthesetwoservicesmustcontainanedgebetweentherandomvariablesXiandXj.Note,thatwhilstserviceiandjmayverywellaffectservicek,say,furtherdownstreamtoj,ourfocushereisonthesimplestDAGrepresentingtheworkﬂow,i.e.theBayesiannetstructureshouldbefreeofloops.Similarly,resourcesharingmayberepresentedbyservicesformingtheparentstoaKERT-BNnodeembodyingtheresourcetheyshare.

Giventhattheelapsedtimeoneachserviceiscollectivelyformingtheend-to-endresponsetime,itisnaturaltoassumethatresponsetimenodeDisdependentuponallelapsedtimenodesX1,...,Xn.Thus,KERT-BNcapturesthisdependencybyaconditionaldependenceofDontheentiresetX,i.e.PD(D|Φ(D))≡PD(D|X).

ThefullDAGstructureoftheKERT-BNfortheeDiaMoNDscenarioinFigure1isdepictedinFigure2,wheretheimagedaiservices)onthetwositesareinvokedinparallelaftertheimagelistservicesarecalled.

X1X2X3X4X5X6VariableServiceX1image_listX2X3X4X5work_listimage_locator_localogsa_dai_localimage_locator_remoteogsa_dai_remoteDX6Figure2.KERT-BNfortheeDiaMoNDscenario.

3.3Determiningparameterswithworkﬂow

HavingidentiﬁedanappropriateDAGfortheKERT-BNandtheparentsetsforeachnode,the

CPDsinEquation3cannowbedetermined,typicallyfromthedata[13].Potentially,parameterlearningcanbequiteexpensivewhennodeshavemany(discrete)parents(e.g.DinFigure2)or(discrete)parentshavehighcardinality2.

Specialmodels(e.g.NaiveBayesiannetworkandTAN)havebeenproposedtoreducethecomplexityofparameterlearning,byfocusingonlyonimportantparent-childrendependenciesthatmainlyaffectmodelaccuracy.Onegoalofthisworkistominimizethecostofparameterlearningwithoutthisaccuracycompromisebymakinguseofdomainknowledge.SpeciﬁctoaKERT-BN,theneedforlearningPD(D|X1,...,Xn)canbelargelyeliminatedbyexploitingpreciseworkﬂowinformationinthefollowingmanner:

PD(D=f(X)|X)=1−lPD(D=f(X)|X)=l

(4)

wherefisadeterministicfunction(seetheexampleinthenextparagraph)thatlinkselapsedtimestoresponsetimeusingworkﬂowinformation3;listheprobabilityofa“leak”situation[17]whereresponsetimeDisnotdeterministicallygivenbytheworkﬂow-derivedfunctionf(X)duetonoises(e.g.errorsmaybeintroducedintothepredictionofDusingf(X),becausetheplacementofmonitoringpointsthroughcodeinstrumentationmayberestrictedandsubsequentmeasurementstakenonXarenotprecise).

ForFigure2,thedeterministicfunctiongivingtheCPDis:D=X1+X2+max(X3+X5,X4+X6).Themaximumoperationistheresultofparallelinvocationoftheimage

daiservices)onthetwosites,whereasthesumoperationsisaproductof

sequentialserviceinvocations.

3.4Decentralizedparameterlearning

TheKERT-BNCPDs,PXi(Xi|Φ(Xi)),i=1,2,...,ncanbeacquiredbycollectingthenum-berofdatainstancesXi=xi∧Φ(Xi)=Φ(xi),wherexiandΦ(xi)areinstancesofelapsedtimesofXiandΦ(Xi),andusinganymaximumlikelihood(i.e.normalizedcounts)orBayesianmethod[13].

NoticethatthecomputationonlyrequiresdataaboutXiandallofitsKERT-BNparentsΦ(Xi).ThisdatalocalitysuggeststhatthecomputationofeachPXi(Xi|Φ(Xi))canactuallybeperformedonservicei,NTothisend,communicationcanbesetupbetweenthemonitoringagent(seeSubsection2)forserviceiandthoseagentsforservicescorrespondingtoΦ(Xi).ForserviceswhereΦ(Xi)=∅,nocommunicationisneeded.Suchcommunicationtakesplaceperiodicallyatafrequencythatwillnotﬂoodthenetwork.Thecommunicateddataarebatchedtogetherwithlocallycollecteddataoneachservicei(serviceswhereΦ(Xi)=∅batchlocaldataonly).TheparametercomputationisinitiatedandtheresultsreportedeveryTCON.

ItcanbenotedfromFigure1thatuserrequestsaresentfromimmediateupstreamservices

local)toadownstreamservice(e.g.ogsalocal).These(e.g.image

communicationscanbeleveragedtosendelapsedtimedatafromparentsΦ(Xi)toXi,byattachingthedatainanextraSOAPsegmentattheendoftheapplicationrequestmessages.ThecomputationofPXi(Xi|Φ(Xi))canthenbeconductedonserviceiwhenneeded.Thisideawillbepursuedfurtherinthefuture.

Decentralizingtheparameterlearningcomputationshasseveralattractions:•The(potentiallylarge)computationalexpensedoesnotfallonacentralmanagementnodewhichmaybecomeabottleneckasthesystemscales.EventhoughtheentireKERT-BNstructureisstillmaintainedinthecentralserver,itwillprovefarmorelightweightthanstoringandcomputingallCPDsandshouldnotcompromisescalabilitytoalargeextent.•Thecomputationcanbeperformedconcurrentlyreducingoveralltimetaken.

•Theelapsedtimedatamaybeembeddedinapplicationrequestsminimizingcommunicationoverhead.

4Evaluationthroughsimulations

Thissectionevaluatesourapproachthroughsimulationsunderacomprehensivesetofsettingsthataredifﬁculttoreplicateinreal-worldsystems.First,theperformanceofKERT-BNiscom-paredagainstNaiveResponseTimeBayesianNetwork(NRT-BN)(i.e.learnedpurelyfromdataviabothstructurelearningwithK2[5]andparameterlearning)inadecentralizedprocess.Thentheextrabeneﬁtoflearningunknown(fromdomainknowledge)KERT-BNparametersinadecentral-izedmannerisassessed.

4.1Evaluationsettingsandmetrics

Experimentsinthissectionwereconductedwithinaservice-orientedsystemsimulatedinMatlab[11].Thesimulatedservicesreceiveandsendcallsamongotherandrandomlygenerateaprocessingdelayuponreceivingcalls.Theyareassembledtogetherbydifferentworkﬂowstoconstitutesimulatedapplications.Thesimulateddelays(andresponsetimes)areusedtoformtrainingandtestingdatasets.ContinuousKERT-BNandNRT-BNmodels(wherelinEquation4isassumedtobe0)withGaussianCPDswereconstructedusingtrainingdataandthenrunagainstthetestingdata,soastogaugetheirperformanceintermsofthefollowingtwometrics:

•Constructiontime-thetimeittakestobuildtheentireBayesiannetwork(i.e.includingthestructureandparametervalues)

•Data-ﬁttingaccuracy-theloglikelihoodoftestingdatagiventheBayesiannetworkmodel,log10p(Data|BN)[13].Thehigherthelikelihood,thebetterthemodelﬁts(i.e.accuratelyrepresents)thetestingdata.

TheexperimentswereimplementedusingtheMatlabBayesiannetworktool-box[12]andcon-ductedonaRedhatLinuxmachinewithtwo3.0GHZdualcoreCPUsand1GBmemory.Inallexperimentruns,theenvironmentalmetric(seeSection2)wassettoK=3toemulatemodeltraininginenvironmentswheredatamorethan2constructionintervals(seeSection2)oldweredeemeduncorrelatedwithpresentenvironmentstatus.ThedatacollectionintervalhadthevalueofTDATA=10seconds,ahighfrequencyatwhichthereporteddatawillnotﬂoodthenetworkinaheavilyloadedenvironment.Alldata-ﬁttingaccuracynumbersaremeasuredagainstatrainingsetof100datapoints4.2KERT-BNvs.NRT-BN

Intheﬁrstexperiment,boththeconstructiontimeandthedata-ﬁttingaccuracyofKERT-BNandNRT-BNbuiltfor30simulatedservicesarecompared.Thetrainingsetscontainedfrom36datapoints(i.e.K∗αModel=3∗12=36andmodelconstructionintervalTCON=αModel∗TDATA=2minutes)to1080datapoints(withK∗αModel=3∗360=1080andmodelconstructionintervalTCON=60minutes).Foreachtrainingsetsizeconsidered,theexperimentwasrepeated10times(eachwithfreshtrainingandtestingdata)toobtainaveragemodelperformance.

TheresultofthisexperimentisillustratedinFigure3.Asonewouldexpect,theconstructiontimeofbothNRT-BNandKERT-BNgrowslinearlywiththetrainingsetsize.KERT-BNcon-sistentlyoutperformsNRT-BN,andtheadvantagebecomesincreasinglypatent,asthetrainingset

grows.ThisobservationcanbeexplainedbythefactthatKERT-BNonlyrequirespartialparame-terlearningwiththetrainingdata,whereasNRT-BNmustgothroughbothanexpensivestructurallearningphaseandafullparameterlearningphase.

Despiteskippingstructurallearningcompletelyandparameterlearningpartly,KERT-BNstillachievesaslightlysuperioraccuracyasisillustratedintherighthalfofFigure3.Inaddition,theNRT-BNappearstobequitesensitivetotrainingdatasizeandrequirearelativelylargenumberofdatapoints(about600inFigure3withαModel=180andmodelconstructionintervalTCON=1800seconds=30minutes)toreachastableaccuracylevel.Toavoiddataaccuracyloss,shortmodelconstructionintervalsmightnotbeusedwithNRT-BNwhennecessary,ifdatacollectionisnotsufﬁcientlyfrequent.Incontrary,KERT-BNisnotliabletosuchrisks,asitsaccuracyconvergesmuchmorequickly.

200180KERT−BNNRT−BN−5400−5300Construction time (x seconds)1601401201008060402000200Data−fitting accuracy−5500−5600−5700−5800−5900−6000KERT−BNNRT−BN020040060080010001200Training data set size40060080010001200Training data set sizeFigure3.KERT-BNperformancevs.NRT-BNperformancewithdifferenttrainingsetsizes.

Inthesecondexperiment,theperformanceofKERT-BNandNRT-BNwhenappliedtoupto100simulatedservices(embodyingsmalltolargedatacenters)isplotted.Themodelsaretrainedwithtrainingsetsofsize36datapoints(againαModel=12andTCON=2minutes)tostudyhowpracticalitistolearnthemodelsinfastchangingenvironmentsofdifferentsizes.Thisexperimentwasalsorepeated10times.

Figure4illustratestheoutcomeofthesecondexperiment.Anundesirednon-linearcorrelation(discussedinSubsection3.2)betweenNRT-BNconstructiontimeandthenumberofnodesinthemodelcanclearlybeseeninthelefthalfoftheﬁgure.Thiscorrelationdedeteriorateevenfasterinbiggerenvironmentsnotillustratedintheﬁgure,takingover2hoursfor200services,over10hoursfor300,andmorethan2daysfor500services.Consequently,astheenvironmentgrowsinsize,NRT-BNmaysimplybeimpossibletobuildatshortmodelconstructionintervals,itslearningtimeeclipsinghowoftenitissupposedtoberenewed.Inthisexperimentforexample,NRT-BNbeingconstructedatthedesignatedintervalofTCON=2minuteswillnotbefeasibleforanyenvironmentwithmorethan60services.Incontrast,theconstructiontimeofKERT-BNremainsﬂatacrossdifferentnumberofnodes,asonlyrelativelycheapparameterlearningamongservicenodesisrequired.TherighthalfoftheﬁgureoncemoreconﬁrmsthatKERT-BNachievesgreateraccuracythanNRT-BNundervarioussettings(numberofservicesinthisexperiment).

700KERT−BN600NRT−BN0−0.2−0.4x 104

KERT−BNNRT−BNConstruction time (x seconds)Data−fitting accuracy020406080100500−0.6−0.8−1−1.2−1.4−1.6400300200100−1.8−20204060801000Number of BN nodesNumber of BN nodes

Figure4.KERT-BNperformancevs.NRT-BNperformancewithdifferentenvironmentsizes(i.e.servicenumbers).

4.3Decentralizedparameterlearningvs.centralizedparameterlearning

HavingshownthatKERT-BNoutperformsNRT-BNinvariouscentralizedlearningsituations,thisadvantageisfurtherextendedbylearningtheunknownKERT-BNparametersinadecentral-izedmanner,asisproposedinSubsection3.4.Tothisend,thetimetakentolearneachCPDinKERT-BNsismeasured.SincetheseCPDswillbecomputedinparallelonmonitoringagentsinpractice,thedecentralizedlearningtimeisthemaximumofindividuallearningtimesacrossallCPDs,andiscomparedagainstthetimeforregularcentralizedKERT-BNparameterlearninginFigure5.TheaccuracyofthesetwoKERT-BNparameterlearningmethodsisnotplottedonthegroundsthattheyproduceprincipallythesameparameters.

ForeachKERT-BNsize,theparametersof20randomlygeneratedKERT-BNsarelearned,withtheaveragedecentralizedlearningtimeandaveragecentralizedlearningtimeplottedintheFigure.Itcanbeseenthatthedecentralizedlearningtimeisconstantlysuperiortothecentralizedlearningtime.Theﬁgurealsorevealsanobvioustrendofthissuperioritybecomingmoreandmoreconsiderableasthenumberofservices(thusthenumberofKERT-BNCPDs)increase.

5Applications

ThissectiondemonstratestheuseofKERT-BNintacklingreal-worldperformancemanage-mentproblemsontheeDiaMoNDtest-bed,andfurtherjustiﬁestheapproachpresentedinthispa-local,imagelocal,per.TheeDiaMoNDservicesshowninFigure1,ogsa

ogsaremote,andimageremotewerehostedbyfourAIXmachineswithtwo3.0GHZdualcoreCPUsand2GBmemory.Theimagelistserviceswererunontwoseparate3.0GHZdualcoreCPUsofaRedhatLinuxserverwith1GBmemory.Sincetheentiretest-bediswithinthesamesub-net,extraroutingwasimposedbetweenimage

locator

4.5KERT−BN parameter learning time (x seconds)Decentralized learning43.532.521.510.50Centralized learning0510KERT−BN sizes

1520253035Figure5.Decentralizedlearningtimevs.centralizedlearningtimeforvariousenvironmentsizes(i.e.servicenumbers).

GiventhateDiaMoNDfostersarelativelydedicatedandstablesystemwhereTDATA=20seconds,abiggercorrelationmetricK=10andalongerconstructionintervalTCON=20minutes(αModel=120)werechosen.Discretemodelsarebuiltfortworeasons:1)therearecompara-tivelymanydatapointsandtheymaynotﬁtGaussiandistributionsinpractice;2)MATLABBNTdoesnotsupportnon-lineardeterministicCPDsthatcontainmaximumrelationships.5.1dComp:Compensatingformissingdata

InlargedistributedsystemsliketheGrid,performancedatafromsomesystemcomponentsmaygomissingdueto1)lackofinstrumentation,2)failureintheactofdatareportingor3)theneedtoreducemonitoringoverhead.Tocopewiththeconsequences,mechanismsmustbeprovidedtosupportestimationofperformancestatesofunobservablecomponentsusingdatafromtheobservableones.

Typically,theonlyknowledgeavailableaboutelapsedtimeonunobservablecomponents(whoselatestperformancedatagomissing)isfromhistoricalmeasurementsorcomponentproviders.Suchpriorknowledgeislikelytobeobsolete,impreciseornotavailable.dComp,atechniquethatap-pliesKERT-BNtoupdatethispriorknowledgewithcurrentelapsedtimemeasurementsfromob-servablecomponentsaswellasresponsetimemeasurements,isproposedtoaddressthisproblem.dCompcomputesaPosterior(Probability)Distributionp(Y|O=E(o))usingstandardBN-basedinference[13]basedonEquation3foreachYwherenoelapsedtimedatawasavailable(OistheobservablesetofservicesandE(o)isthecurrentmeasurementmean).Intheabsenceofusingfullblownﬁll-inmethods(likeExpectationMaximization),itsufﬁcestojustusethesummaryofobservationstatistics(themean,E(o))toassessY.

ToillustratedComp,theKERT-BNinFigure2isconsidered.Noserviceisassumedtobeobservable.dCompisemployedtoinfertheposteriordistributionforX4usingobservationsontherestofthevariables.Figure6(a)comparestheinferredposteriordistributionofX4withthepriordistributionofX4,anddemonstrateshowtheposteriordistributionhasupdatedourknowledge

aboutX4.Inthiscase,theposteriordistributionhasshifted(fromthepriordistribution)towardtheactualelapsedtimeandbecomemoredeterministicwithanarrowershape.

10.90.80.7Prior distributionPosterior distributionObservation10.90.80.7Prior distributionPosterior distributionObservationProbability0.60.50.40.30.20.100102030405060Probability0.60.50.40.30.20.100102030405060708090Elapsed time (x0.01 second)Response time (x0.01 second)(a)dCompexample:posteriordistributionofX4vs.(b)pAccelexample:responsetimehadX4per-priordistributionofX4.formednormallyvs.observedresponsetime.

Figure6.ExamplesofdCompandpAccel.

5.2pAccel:Acceleratingkeyserviceperformance

Duetothecomplexityofservice-orientedenvironments,asigniﬁcantperformanceboostforaparticularservicemaynotleadtosystem-widebeneﬁts.Forinstance,ifserviceAisbeinginvokedinparallelwithanotherserviceBthathasasigniﬁcantlylongerelapsedtime,reducingA’selapsedtimecandolittletoimprovetheoverallperformance.Amethodtoassesstheend-to-endimpactoflocalactionsisessentialtoavoidingspendingalotofeffortinacceleratingthoseservicesthatwillbringlittleresponsetimeimprovement.

pAccelisameanstoachievethisgoalthroughtheuseofKERT-BN.BasedonEquation3,pAccelcomputestheposteriorresponsetimedistributionp(D|Z=E(z))giventhemeanofapredictionabout,Ztheelapsedtimeofanyservice.Theobservationmeanissufﬁcientforthesamereasonaswasarguedintheprevioussubsection.

Recalltheexampleintheprevioussubsection.AposteriorresponsetimedistributioncannowbecomputedwithX4reducedto90%(e.g.afterlocalresourceallocationactions).Figure6(b)showsthattheposteriorresponsetimeprovidesagoodapproximationoftheactualimprovedresponsetimemean.Thedifferencebetweentheposteriordistributionandthepriordistributioncanbeusedtogaugethebeneﬁtofresourceactionsandguideautonomicdecisionmaking.5.3Justiﬁcation

TheefﬁcacyofKERT-BNduringmodelre-constructionsintheeDiaMoNDscenarioinFigure1isexaminedbycontrastingtheresponsetimedistributionprojectedusingpAccelwithKERT-BNandNRT-BNagainstrealresponsetimemeasurements.

Consideringthatbothhumanusersandautonomicsoftwarearelikelytobeinterestedinas-sessmentslike“Whatistheprobabilitythatresponsetimewillexceedthethreshold(s)?”,wedraw

thecomparisonusingRelativeThresholdViolationProbabilityErrordeﬁnedasfollows:

󰀡=

|Pbn(D>h)−Preal(D>h)|

fromsomeblack-boxcomponents.2)modelconstructioncanbepronetohumanerrors;3)modelsbuiltareaprioriandmaynotadaptwelltoworkloadorsystemchanges;4)assumptionsthatdeviatesigniﬁcantlyfromrealsystemconditions(e.g.thesystembeinginstablestate[19])mayhavetobemade.

Thesecondschoolofworkscontainemergingresearchrootedinstatisticallearningtechniquesusingperformancedatacollectedthroughinstrumentation.Forexample,tree-augmentednaiveBayesiannetworksarelearnedfromdatatocorrelatesystemcomponentresourcestate(CPUusage,memoryusageetc.)withservicelevelagreementviolation/compliancestates[8].Unlikethoseintheﬁrstschool,statisticallylearnedmodelsmostlyassumeverylittleornodomainknowledgeandcanbeprogrammaticallyupdated/reconstructedinresponsetosystemchanges.Notonlyarethesemodelsoftenlearnedentirelyfromscratchatpotentiallysubstantialcomputationalcost,buttheycanberathersensitivetonoisyormissingdata.

Theapproachdescribedinthispaperovercomestheseweaknesses,byincorporatingdomainknowledgeinthestatisticallearningframeworkanddecentralizingthelearningprocedure.Rishet.al[16]hasalsoadoptedaknowledge-enhancedstrategytoconstructafaultdeterminationmodel.However,applyingRish’sstrategyherewouldbedifﬁcultgivenitsdependencyonbinarystatevariables,binarydomainrelationshipsandstrongindependenceassumptions.

7ConclusionsandFutureWork

Thispaperpresentsanautomatedandcost-effectivesolutiontoend-to-endperformancemod-elingforservice-orientedsystems,anessentialissuetowardsprovidingautonomiccapabilitiesintheseenvironments.Weleverageacombinationofeasilyattainabledomainknowledge(mainlytheworkﬂowinthispaper)andperformancedatacollectedsystem-wide,toinduceaknowledge-enhancedresponsetimeBayesiannetwork(KERT-BN)thataccuratelycorrelatessystemcompo-nentperformancetoend-to-endQoSgoals.Formodelelementsthatcannotbedeterminedinthisway,adecentralizedlearningmechanismisputinplacetoreducelearningoverheadincurred.Experimentsinbothsimulationandreal-worldcontextsshowthattheutilizationofdomainknowl-edge1)largelyreducesthemodelbuildingcostwithoutsacriﬁcingmodelaccuracy;and2)rapidlyguidesthemodeltrainingprocesstowardsanaccurateproductwithfewdata.Thesefeaturesareparticularlyvaluableinhighlydynamicandcomplexenvironmentswherethecurrentmodelquicklyexpiresandmustberebuilt.

Theapproachisautomatedthroughemployinganinstrumentation-basedtechniquetotheau-tomaticextractionofworkﬂowandresourcesharinginformation[20],programmaticallyfeedingthisknowledgetogetherwithperformancedataintotheBayesiannetworkframework,andsub-sequentlyinducingthemodelunderaperiodicalscheme.Theautomatedmodelconstructioncanservetofurtherminimizetheneedforhumanengagementsinmodel-drivenproceduresandsteerthemclosertocompleteautonomicsolutions.

TheKERT-BNapproachcanbeeffortlesslygeneralizedandappliedtoanyinstrumenteddis-tributedsystemfeaturingusertransactionsthattraversesystemcomponents.Withminoradjust-ments,itmayalsobeusedtomodeltherelationshipbetweencomponent-levelmetricsotherthanelapsedtime(e.g.CPUormemoryusage)andend-to-endperformancegoals.Theseissuesarepromisingtopicsoffuturework.

Anotherimportantextensionofourworkisemployingdomainknowledgeanddecentraliza-tiontechniquestoreducethecostofprobabilityassessmentusingaconstructedmodel.Crucial

autonomicroutinessuchasresourceprovisioningandproblemlocalizationwillproﬁtgreatlyonrapidresponsetimeassessment.Addingthisvaluetotheworkpresentedinthispaperwillleadtoaccurateperformancemodelsthatareinexpensivetobuildanduse,andgoalongwaytofacilitateefﬁcient,scalableautonomicsystemmanagement.

References

[1]P.Barham,A.Donnelly,R.Isaacs,andR.Mortier.Usingmagpieforrequestextractionandworkload

modelling.InProceedingsofthe6thSymposiumonOperatingSystemsDesignandImplementation(OSDI’04),,December2004.

[2]J.M.Brady,D.J.Gavaghan,A.C.Simpson,M.Mulet-Parada,andR.P.Highnam.eDiaMoND:A

grid-enabledfederateddatabaseofannotatedmammograms.InF.Berman,G.C.Fox,andA.J.G.Hey,editors,GridComputing:MakingtheGlobalInfrastructureaReality,pages923–943.WileySeries,2003.

[3]J.Cardoso,J.Miller,A.Sheth,andJ.Arnold.Modelingqualityofserviceforworkﬂowsandweb

serviceprocesses.Technicalreport,LSDISLab,ComputerScience,UniversityofGeorgia,May2002.[4]G.F.C.E.ClarkGlymour(Editor).Computation,CausationandDiscovery.MITPress,1999.

[5]G.CooperandE.Herskovits.Abayesianmethodfortheinductionofprobabilisticnetworksfromdata.

MachineLearning,pages309–347,1992.

[6]N.J.Dingle,P.G.Harrison,andW.J.Knottenbelt.Responsetimedensitiesingeneralisedstochastic

petrinetmodels.InWOSP’02:Proceedingsofthe3rdinternationalworkshoponSoftwareandperformance,pages46–54,NewYork,NY,USA,2002.ACMPress.

[7]N.FriedmanandM.Goldszmidt.SequentialupdateofBayesiannetworkstructure.In13thConf.on

UncertaintyinArtiﬁcialIntelligence,pages165–174,1997.

[8]C.Ira,G.Moises,K.Terence,S.Julie,andC.Jeffrey.Correlatinginstrumentationdatatosystem

states:Abuildingblockforautomateddiagnosisandcontrol.Technicalreport,HPLabs,November2004.

[9]J.O.KephartandD.M.Chess.Thevisionofautonomiccomputing.Computer,36(1):41–50,2003.[10]Y.Lu,T.Abdelzaher,C.Lu,L.Sha,andX.Liu.Feedbackcontrolwithqueueing-theoreticprediction

forrelativedelayguaranteesinwebservers.InRTAS’03:ProceedingsoftheThe9thIEEEReal-TimeandEmbeddedTechnologyandApplicationsSymposium,page208,Washington,DC,USA,2003.IEEEComputerSociety.

[11]C.Moler.NumericalComputingwithMATLAB.SocietyforIndustrialandAppliedMathematics,

2004.

[12]K.Murphy.Bayesnettoolboxformatlab.ComputerScienceandStatistics,2001.[13]R.Neapolitan.ProbabilisticReasoninginExpertSystems.WileyInterscience,1989.

[14]M.P.PapazoglouandD.Georgakopoulos.Introduction.Commun.ACM,46(10):24–28,2003.

[15]N.M.P.PeterG.Harrison.Performancemodellingofcommunicationnetworksandcomputerarchi-tectures.Wokingham,Addison-Wesley,1992.

[16]I.Rish,M.Brodie,andS.Ma.Accuracyvs.efﬁciencytrade-offsinprobabilisticdiagnosis.InEigh-teenthnationalconferenceonArtiﬁcialintelligence,pages560–566.AmericanAssociationforArtiﬁ-cialIntelligence,2002.

[17]S.RussellandP.Norvig.ArtiﬁcialIntelligence:AMordenApproach.PrenticeHall,2003.

[18]D.J.SpiegelhalterandS.L.Lauritzen.Sequentialupdatingofconditionalprobabilitiesondirected

graphicalstructures.Networks,pages579–605,1990.

[19]B.Urgaonkar,G.Paciﬁci,P.Shenoy,M.Spreitzer,andA.Tantawi.Ananalyticalmodelformulti-tier

internetservicesanditsapplications.SIGMETRICSPerform.Eval.Rev.,33(1):291–302,2005.

[20]R.Zhang,S.Heisig,S.Moyle,andS.McKeever.Ogsa-basedgridworkloadmonitoring.InPro-ceedingsofthe5thIEEEInternationalSymposiumonClusterandGridComputing(CCGrid05).IEEEComputerSocietyPress,April2006.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文