Service-orientedSystems
RuiZhang
OxfordUniversityComputingLaboratory
WolfsonBuilding,ParksRoad,OxfordOX13QD,Englandruiz@comlab.ox.ac.uk
AlanBivens
IBMT.J.WatsonResearchCentre
19SkylineDrive,
Hawthorne,NY10532,USA
jbivens@us.ibm.com
IeadRezek
OxfordUniversityEngineeringScienceDepartment,
ParksRoad,OxfordOX13PJ,England
irezek@robots.ox.ac.uk
ABSTRACT
Asservice-orientedenvironmentsgrowinsizeandcomplexity,managingtheirperformancebe-comesincreasinglydifficult.Toassistadministrators,autonomictechniqueshavebeenadoptedtopermittheseenvironmentstobeself-managing(problemlocalization,workloadmanagement,etc.).Thesetechniquesneedasenseofsystemstateandtheabilitytoprojectanewstategivensomechangewithintheenvironment.Recentworkaddressingthisissuefrequentlyusedstatisti-callylearnedmodelswhichlearnedpatternsfromscratch.However,manyenvironmentsalreadyhavemanagementfacilitiesinplacethatcouldprovidepreciseandusefulinsights(e.g.workflows)intothesystem.Thispaperintroducesamethodofmodelingservice-orientedsystemperformanceusingBayesiannetworksandspecificallyaddressesthebenefitsobtainedbyincorporatingtheseinsightsintothemodellearningprocess.Tofurtherminimizemodelbuildingcosts,wedeviseadecentralizedmethodtoconcurrentlylearnpartsofthemodelwhereknowledgeinclusionisim-possibleandmustbeinferredfromdata.Simulationsandapplicationsinactualenvironmentsshowsignificantreductionsinlearningtime,betteraccuracyandstrongertolerancetosmalllearningdatasets.
KEYWORDS
Responsetimemodeling,Autonomicperformancemanagement,Bayesiannetworks,Gridcom-puting
1Introduction
Service-orientedcomputing[14]facilitatesdistributedapplicationdevelopmentacrossorgani-zationalandgeographicalboundariesthroughthepromotionofself-describingandopensoftwarecomponents.Intheseenvironments,userrequeststraversemultipleheterogeneouscomponentsdrawntogetheron-the-fly,makingitevermorechallengingtomeetend-to-endqualityofservice
(QoS)(e.g.responsetime)goals.Therefore,autonomousmanagement[9]softwareinservice-orientedenvironmentsrequiresamodeltocapturethecomplexsystemdynamicsandlinkindivid-ualcomponentbehaviorstoend-to-endperformance(typicallyresponsetime)states.Thismodelmustaddresstwochallenges:1)itmustprovideaccurateguidancetoautonomicactivitiessuchasresourceprovisioning,loadbalancing,andperformanceproblemlocalizationandremediation,and2)duetothedynamicnatureofautonomicenvironments,themodelmustbereconstructedfromtimetotime,removingpastandobsoleteinformationtoreflectmostrecentsystemstates.Thisprocessmustbeaccomplishedatlittlecostwithminimalhumanintervention.
Muchworkhasbeendoneinanalyticalmodelingofcomputersystemperformance.Classicperformancemodelingtheoriessuchasqueueingnetworks,petrinetsandcontroltheoryhavebeenappliedtorealworldsystems[19,6,10].Althoughthesemodelsarestableandmathematicallysound,theymaybeverydifficulttoderivemanuallyfromcertaindomains.Themodelingprocesscanrequiresubstantialeffortfromhumanexpertsandbesubjecttoinadvertenthumanmistakes.
Morerecently,somegroupshavecreatedtechniquesthat“train”modelsfromperformancedatacollectedthroughinstrumentation/profiling[8,1].Thesestatisticallearningapproachesdonotassumehumaninvolvementandrequirelittledomainanalysis.Whilepromising,suchpracticesarestillatanearlystage,andcanbecomputationallyveryexpensiveandratherdata-sensitive.
Motivatedbythecontrastingnaturesofanalyticalmodelingandstatisticallearningapproaches,thispaperattemptstoharvestthestrengthsofbothbyencodingexistingdomainknowledgeintothestatisticallearningframework.Inthismanner,notonlycanthevulnerabilitiesofanalyticalmodelingbecompensatedbyinformationextractedfromthedata,butthecostanddata-sensitivityofstatisticallearningcanbereducedbyinformativedomainknowledgeguidance.Inparticular,aknowledge-enhancedBayesiannetisproposedtomodeltheend-to-endresponsetimeofservice-orientedsystems.Themodelleveragesreadilyavailabledomainknowledgeanddecentralizedlearningtominimizemodelconstructioncostwhileoptimizingaccuracy.Thatis,byincorporatingdomainknowledgetheBayesiannetworkstructureencodesthecausalrelationshipsbetweenthesystemcomponents.
Thisworkhasleadtothefollowingresearchcontributions:•Usingeasilyattainabledomainknowledge,weestablishtheBayesiannetworkresponsetimemodelwhileeliminatingtheneedforstructurelearningandeasingtheexerciseofparameterlearning.
•Whennoadequatedomainknowledgecanbeutilized,weformulateanovelwaytoconcur-rentlylearnmodelparametersatdistributedlocationsfromlocaldata,thusfurtherdiminish-ingmodelconstructioncosts.
•Anautomatedimplementationoftheapproachisdeliveredtooperateunderaflexiblemodel(re)constructionschemesandcanbeintegratedintoautonomicsolutionswithminimaleffort.•Comprehensivesimulationsarepresenteddemonstratingthevastlyreducedmodelconstruc-tiontimeanddata-sensitivity.Tworeal-worldcasestudiesonaservice-orientedGridwerealsopresented.
Theremainderofthepaperisstructuredasfollows.Thesequelprovidesnecessarybackground.Section3presentstheknowledge-enhancedBayesiannetworkforend-to-endresponsetimemod-eling.TheapproachisexperimentallyevaluatedusingsimulationsinSection4.Tworeal-world
applicationsofthemodelaredetailedinSection5.Section6reviewsrelatedwork.Section7concludesanddiscussesfutureresearch.
2Performancemonitoringandmodelingforservice-orientedsystems
TheeDiaMoNDGrid[2],anOGSA-enabledfederateddatabaseofannotatedmammograms,willbeusedasareferencetoservice-orientedsystemsthroughoutthispaper.Figure1showsacommoneDiaMoNDscenario.Initaradiologistretrievessomemammogramsforanalysis.SixGridservicesareinvolved.Havingreceivedarequestfromtheradiologistclient,theimage
listservice,askingforinformationabouttheimagesassignedtothe
radiologist.SupposethatIDsandlocationsarereturnedoftwoimagesthatneedtobecompared.SincethetwoimagesarestoredinalocalhospitalLandaremotehospitalR,respectively,theimagelocatorserviceonbothsites.Thisleadstotheinvocationoflocalandremoteogsa
Management ServerInternetID_Rimage_RPerformance dataRequests/responsesimage_retrieveServiceeamernsU_RIDL, _IDID_LImage_LDB queryImage_LFigure1.PerformancedatareportinginaneDiaMoNDscenario.Monitoringinfrastructurecom-ponentsarenotshown.
Asisdescribedinpreviouswork[20],standardOGSA-basedmiddleware,areinstrumentedwithmonitoringpoints,whichmeasurethetimeelapsedatmiddlewarecomponents.Amonitoringagentresidesoneachmachine,listeningtothemonitoringpointsfordata,possiblybatchingthembeforereportingthemtothemanagementserver.
Thecollecteddataareusedtoperiodicallyconstructend-to-endresponsetimemodelson-the-fly,soastoreflectlatestdynamicsintheservice-orientedenvironmentandassistautonomicmanagement.Althoughastrategywheremodelsareupdatedwiththelatestdatamayappearlessextreme,thedisperseofolddataisoftennotpossibleundercurrentstatisticalframeworks
suchasBayesiannetwork[18,7].Theresultisthattheseout-of-dateinformationlingersintheupdatedmodelandadverselyimpactsitsaccuracy,makingaschemepurelybasedonupdatesinadequate.Ahybridschemecombiningfrequentupdatesin-betweenreconstructionsmaybemoreappropriate,butisbeyondthescopeofthispaper,whichfocusesonthemorecomplexandcostlymodelconstructionprocedures.
Ingeneral,themodel(re)constructionshouldoccuratinterval,TCON,usingaslidingdatawindowW,allowingthemodeltousethedataofthecurrentintervalaswellas(K−1)previousintervals(seeEquation1).
W=K∗TCON(1)InEquation1,KisanEnvironmentalCorrelationMetricbasedontheautonomicnatureofthe
environmentandTCONisbepartlybasedontheintervalofdatacollectionasshowninEquation2.
Theenvironmentalcorrelationmetric,K,characterizeshowoftenautonomicactionsmayhap-penintheenvironment,renderingthecurrentenvironmentstatesunrelatedtotheirpasts.Envi-ronmentswhereradicalchanges(e.g.resourceallocation,failurerecoveryactionsetc.)happenmoreoftenwillhavealowerKvaluemakingthemodelconstructiononlyusemorerecentdata.Ifthingschangelessdramatically,alargerKvaluewillallowthemodeltoconsideralargerwindowofdata1.
TCON=αmodel∗TDATA
(2)
whereαmodelistheModelConstructionCoefficientandTDATAistheDataCollectionInterval.Modelsthatcanbebuiltquicklywillhavealowerαmodelvalueindicatingthattheycouldbeappliedinenvironmentsdemandingmorefrequentmodelrecycling.TDATAishowoftenadatapointisreportedanddependentonthemonitoringinfrastructure.K∗αmodelgivesthenumberofdatapointsavailableforinferringthemodel.
3Knowledge-enhancedBayesiannetworkmodel
ThissectionpresentsaBayesiannetworkmodelforresponsetimesinservice-orientedsystems,whichisconstructedusingacombinationofreadilyavailabledomainknowledgeandperformancedatacollectedviainstrumentation(i.e.puttingmonitoringpointsintothesystemasisdescribedintheprevioussection).
3.1Mathematicalframework
ABayesiannetworkisusedtomodeltherelationshipbetweenelapsedtimeonservicesandend-to-endresponsetime,capitalizingonitssupportofautomaticmodelinferencefromdata.Thisformalismisalsochosenbecauseitreflectsthestochasticnatureoftheproblemwhilegraphicallyrepresentingthecausalityamongelapsedtimeonservicesandtheireffectsonend-to-endresponsetime.
DEFINITION1.AKnowledgeEnhancedResponseTimeBayesianNetwork(KERT-BN),isadirectedgraph(DAG)representingthejointdistributionP(D,X1,···,Xn)oftherandomvariablesXi,i=1···,n,theelapsedtimeofservicei,andrandomvariableD,theend-to-endresponsetime,suchthat
n
P(D,X1···Xn)=PD(D|Φ(D))PXi(Xi|Φ(Xi)).(3)
i=1
Here,Φ(Xi)andΦ(D)arethesetofparentsofnodeXi,i=1,...,nandD,respectively,
andwillbedeterminedbyworkflowandresourcesharingknowledgeinthenextsubsection.PXi(Xi|Φ(Xi)),i=1,2,...,nandPD(D|Φ(D))aretheConditionalProbabilityDistributions(CPDs)ofXiandD,respectively,anddescribethedependenciesamongserviceelapsedtimesandtheireffectontheend-to-endresponsetime(tobediscussedfurtherinSection3.3).
AKERT-BNcanbeeithercontinuousordiscrete,dependingontheapplicationneeds.SincecontinuousCPDs(typicallyGaussianCPDs)haverelativelyfewparameters,acontinuousKERT-BNallowsthemodeltoconvergemorequicklywhendataislacking.Thispropertyisattractiveinfastchangingenvironmentswheremodelreconstructionmustbeperformedveryfrequentlybeforealotofdatacanbegathered.Iftheenvironmentsarerelativelystable,adiscreteKERT-BNmaybeconsidered.Contrarytoacontinuousmodel,adiscreteoneimposesnoassumptionontheshapeoftheCPDsandislikelytoachievegreateraccuracywhenthereissufficientdata.3.2Establishingthestructurewithworkflowandresourcesharing
ABayesiannetcanbebuiltintwosteps.Firstdeterminethestructure(i.e.theDAGwhichcapturesthestatistical/causaldependenciesbetweentherandomvariables)andthenobtainthepa-rameters(CPDs).WhilstautomatedmethodsforBayesiannetstructurelearningexist(i.e.methodsforfindingaDAGtopologythatbestfitsthedata[13]),theyaretypicallycomputationallyexpen-sive.Forinstance,aKERT-BNforasystemofnserviceswillhaven+1nodesandexponential(inn+1)numberofDAGstructures.SuchcomplexitymakesitintractabletoexhaustivelysearchforthebestDAG[4]inlargeenvironments.EvengreedyalgorithmslikeK2[5]needstoexploreO((n+1)2)possibilities.
Intuitively,dependencyispresentbetweentworandomvariablesifthereissomeconnectionbetweenthetwosuchthatchangeinonevariablecanbesomehowcommunicatedtotheothervariablewhereacorrespondingvariationmaybetriggered.Inthecontextofservice-orientedsystems,itappearsreasonabletoassumethatsuchconnectionsbetweentheperformance(elapsedtimeinthispaper)oftwoservicesiandjcanexistintwoforms:
•Whileanumberofrelationshipsmaybeinterpretedfromworkflowinformation,thispaperattemptstoreducethenumberofdependenciesconsideredbyonlyaccountingfordirectandimportantrelationshipsbetweenservicesandtheirimmediateupstreamservices.Ifoneservice(say,i)istheimmediateupstreamserviceoftheotherservice(j)intheworkflowgraph,classicperformancemodelingliterature[15]suggestsi’selapsedtimecanbetiedtoitsthroughput.Thuselapsedtimeofjisrelatedtoithroughj’sinputfromi(assumingthatj’sinputcandetermineitselapsedtime).Aburstini’sworkloadoraslowdowninitsserviceratemayaffectitselapsedtime,andisalsoreflectedbychangeinj’selapsedtime.•Thetwoservicesaresharingacommonresource(e.g.CPU,memory,network).Statusofthe
commonresourcecanbetiedtotheperformanceofbothservices,resultingindependencybetweenthem.
Boththeworkflowandresourcesharinginformationcanbecapturedbyexistingmonitoringin-frastructures[20]orarewelldocumentedatsystemdesignstage.ThisknowledgecanthenusedtodeterminethedependenciesamongKERT-BNelapsedtimenodesX1,...,Xnatlittlecost.Inthecaseofserviceiaffectingdirectlyservicejdownstream,aBayesiannetworkrepresentingtheworkflowbetweenthesetwoservicesmustcontainanedgebetweentherandomvariablesXiandXj.Note,thatwhilstserviceiandjmayverywellaffectservicek,say,furtherdownstreamtoj,ourfocushereisonthesimplestDAGrepresentingtheworkflow,i.e.theBayesiannetstructureshouldbefreeofloops.Similarly,resourcesharingmayberepresentedbyservicesformingtheparentstoaKERT-BNnodeembodyingtheresourcetheyshare.
Giventhattheelapsedtimeoneachserviceiscollectivelyformingtheend-to-endresponsetime,itisnaturaltoassumethatresponsetimenodeDisdependentuponallelapsedtimenodesX1,...,Xn.Thus,KERT-BNcapturesthisdependencybyaconditionaldependenceofDontheentiresetX,i.e.PD(D|Φ(D))≡PD(D|X).
ThefullDAGstructureoftheKERT-BNfortheeDiaMoNDscenarioinFigure1isdepictedinFigure2,wheretheimagedaiservices)onthetwositesareinvokedinparallelaftertheimagelistservicesarecalled.
X1X2X3X4X5X6VariableServiceX1image_listX2X3X4X5work_listimage_locator_localogsa_dai_localimage_locator_remoteogsa_dai_remoteDX6Figure2.KERT-BNfortheeDiaMoNDscenario.
3.3Determiningparameterswithworkflow
HavingidentifiedanappropriateDAGfortheKERT-BNandtheparentsetsforeachnode,the
CPDsinEquation3cannowbedetermined,typicallyfromthedata[13].Potentially,parameterlearningcanbequiteexpensivewhennodeshavemany(discrete)parents(e.g.DinFigure2)or(discrete)parentshavehighcardinality2.
Specialmodels(e.g.NaiveBayesiannetworkandTAN)havebeenproposedtoreducethecomplexityofparameterlearning,byfocusingonlyonimportantparent-childrendependenciesthatmainlyaffectmodelaccuracy.Onegoalofthisworkistominimizethecostofparameterlearningwithoutthisaccuracycompromisebymakinguseofdomainknowledge.SpecifictoaKERT-BN,theneedforlearningPD(D|X1,...,Xn)canbelargelyeliminatedbyexploitingpreciseworkflowinformationinthefollowingmanner:
PD(D=f(X)|X)=1−lPD(D=f(X)|X)=l
(4)
wherefisadeterministicfunction(seetheexampleinthenextparagraph)thatlinkselapsedtimestoresponsetimeusingworkflowinformation3;listheprobabilityofa“leak”situation[17]whereresponsetimeDisnotdeterministicallygivenbytheworkflow-derivedfunctionf(X)duetonoises(e.g.errorsmaybeintroducedintothepredictionofDusingf(X),becausetheplacementofmonitoringpointsthroughcodeinstrumentationmayberestrictedandsubsequentmeasurementstakenonXarenotprecise).
ForFigure2,thedeterministicfunctiongivingtheCPDis:D=X1+X2+max(X3+X5,X4+X6).Themaximumoperationistheresultofparallelinvocationoftheimage
daiservices)onthetwosites,whereasthesumoperationsisaproductof
sequentialserviceinvocations.
3.4Decentralizedparameterlearning
TheKERT-BNCPDs,PXi(Xi|Φ(Xi)),i=1,2,...,ncanbeacquiredbycollectingthenum-berofdatainstancesXi=xi∧Φ(Xi)=Φ(xi),wherexiandΦ(xi)areinstancesofelapsedtimesofXiandΦ(Xi),andusinganymaximumlikelihood(i.e.normalizedcounts)orBayesianmethod[13].
NoticethatthecomputationonlyrequiresdataaboutXiandallofitsKERT-BNparentsΦ(Xi).ThisdatalocalitysuggeststhatthecomputationofeachPXi(Xi|Φ(Xi))canactuallybeperformedonservicei,NTothisend,communicationcanbesetupbetweenthemonitoringagent(seeSubsection2)forserviceiandthoseagentsforservicescorrespondingtoΦ(Xi).ForserviceswhereΦ(Xi)=∅,nocommunicationisneeded.Suchcommunicationtakesplaceperiodicallyatafrequencythatwillnotfloodthenetwork.Thecommunicateddataarebatchedtogetherwithlocallycollecteddataoneachservicei(serviceswhereΦ(Xi)=∅batchlocaldataonly).TheparametercomputationisinitiatedandtheresultsreportedeveryTCON.
ItcanbenotedfromFigure1thatuserrequestsaresentfromimmediateupstreamservices
local)toadownstreamservice(e.g.ogsalocal).These(e.g.image
communicationscanbeleveragedtosendelapsedtimedatafromparentsΦ(Xi)toXi,byattachingthedatainanextraSOAPsegmentattheendoftheapplicationrequestmessages.ThecomputationofPXi(Xi|Φ(Xi))canthenbeconductedonserviceiwhenneeded.Thisideawillbepursuedfurtherinthefuture.
Decentralizingtheparameterlearningcomputationshasseveralattractions:•The(potentiallylarge)computationalexpensedoesnotfallonacentralmanagementnodewhichmaybecomeabottleneckasthesystemscales.EventhoughtheentireKERT-BNstructureisstillmaintainedinthecentralserver,itwillprovefarmorelightweightthanstoringandcomputingallCPDsandshouldnotcompromisescalabilitytoalargeextent.•Thecomputationcanbeperformedconcurrentlyreducingoveralltimetaken.
•Theelapsedtimedatamaybeembeddedinapplicationrequestsminimizingcommunicationoverhead.
4Evaluationthroughsimulations
Thissectionevaluatesourapproachthroughsimulationsunderacomprehensivesetofsettingsthataredifficulttoreplicateinreal-worldsystems.First,theperformanceofKERT-BNiscom-paredagainstNaiveResponseTimeBayesianNetwork(NRT-BN)(i.e.learnedpurelyfromdataviabothstructurelearningwithK2[5]andparameterlearning)inadecentralizedprocess.Thentheextrabenefitoflearningunknown(fromdomainknowledge)KERT-BNparametersinadecentral-izedmannerisassessed.
4.1Evaluationsettingsandmetrics
Experimentsinthissectionwereconductedwithinaservice-orientedsystemsimulatedinMatlab[11].Thesimulatedservicesreceiveandsendcallsamongotherandrandomlygenerateaprocessingdelayuponreceivingcalls.Theyareassembledtogetherbydifferentworkflowstoconstitutesimulatedapplications.Thesimulateddelays(andresponsetimes)areusedtoformtrainingandtestingdatasets.ContinuousKERT-BNandNRT-BNmodels(wherelinEquation4isassumedtobe0)withGaussianCPDswereconstructedusingtrainingdataandthenrunagainstthetestingdata,soastogaugetheirperformanceintermsofthefollowingtwometrics:
•Constructiontime-thetimeittakestobuildtheentireBayesiannetwork(i.e.includingthestructureandparametervalues)
•Data-fittingaccuracy-theloglikelihoodoftestingdatagiventheBayesiannetworkmodel,log10p(Data|BN)[13].Thehigherthelikelihood,thebetterthemodelfits(i.e.accuratelyrepresents)thetestingdata.
TheexperimentswereimplementedusingtheMatlabBayesiannetworktool-box[12]andcon-ductedonaRedhatLinuxmachinewithtwo3.0GHZdualcoreCPUsand1GBmemory.Inallexperimentruns,theenvironmentalmetric(seeSection2)wassettoK=3toemulatemodeltraininginenvironmentswheredatamorethan2constructionintervals(seeSection2)oldweredeemeduncorrelatedwithpresentenvironmentstatus.ThedatacollectionintervalhadthevalueofTDATA=10seconds,ahighfrequencyatwhichthereporteddatawillnotfloodthenetworkinaheavilyloadedenvironment.Alldata-fittingaccuracynumbersaremeasuredagainstatrainingsetof100datapoints4.2KERT-BNvs.NRT-BN
Inthefirstexperiment,boththeconstructiontimeandthedata-fittingaccuracyofKERT-BNandNRT-BNbuiltfor30simulatedservicesarecompared.Thetrainingsetscontainedfrom36datapoints(i.e.K∗αModel=3∗12=36andmodelconstructionintervalTCON=αModel∗TDATA=2minutes)to1080datapoints(withK∗αModel=3∗360=1080andmodelconstructionintervalTCON=60minutes).Foreachtrainingsetsizeconsidered,theexperimentwasrepeated10times(eachwithfreshtrainingandtestingdata)toobtainaveragemodelperformance.
TheresultofthisexperimentisillustratedinFigure3.Asonewouldexpect,theconstructiontimeofbothNRT-BNandKERT-BNgrowslinearlywiththetrainingsetsize.KERT-BNcon-sistentlyoutperformsNRT-BN,andtheadvantagebecomesincreasinglypatent,asthetrainingset
grows.ThisobservationcanbeexplainedbythefactthatKERT-BNonlyrequirespartialparame-terlearningwiththetrainingdata,whereasNRT-BNmustgothroughbothanexpensivestructurallearningphaseandafullparameterlearningphase.
Despiteskippingstructurallearningcompletelyandparameterlearningpartly,KERT-BNstillachievesaslightlysuperioraccuracyasisillustratedintherighthalfofFigure3.Inaddition,theNRT-BNappearstobequitesensitivetotrainingdatasizeandrequirearelativelylargenumberofdatapoints(about600inFigure3withαModel=180andmodelconstructionintervalTCON=1800seconds=30minutes)toreachastableaccuracylevel.Toavoiddataaccuracyloss,shortmodelconstructionintervalsmightnotbeusedwithNRT-BNwhennecessary,ifdatacollectionisnotsufficientlyfrequent.Incontrary,KERT-BNisnotliabletosuchrisks,asitsaccuracyconvergesmuchmorequickly.
200180KERT−BNNRT−BN−5400−5300Construction time (x seconds)1601401201008060402000200Data−fitting accuracy−5500−5600−5700−5800−5900−6000KERT−BNNRT−BN020040060080010001200Training data set size40060080010001200Training data set sizeFigure3.KERT-BNperformancevs.NRT-BNperformancewithdifferenttrainingsetsizes.
Inthesecondexperiment,theperformanceofKERT-BNandNRT-BNwhenappliedtoupto100simulatedservices(embodyingsmalltolargedatacenters)isplotted.Themodelsaretrainedwithtrainingsetsofsize36datapoints(againαModel=12andTCON=2minutes)tostudyhowpracticalitistolearnthemodelsinfastchangingenvironmentsofdifferentsizes.Thisexperimentwasalsorepeated10times.
Figure4illustratestheoutcomeofthesecondexperiment.Anundesirednon-linearcorrelation(discussedinSubsection3.2)betweenNRT-BNconstructiontimeandthenumberofnodesinthemodelcanclearlybeseeninthelefthalfofthefigure.Thiscorrelationdedeteriorateevenfasterinbiggerenvironmentsnotillustratedinthefigure,takingover2hoursfor200services,over10hoursfor300,andmorethan2daysfor500services.Consequently,astheenvironmentgrowsinsize,NRT-BNmaysimplybeimpossibletobuildatshortmodelconstructionintervals,itslearningtimeeclipsinghowoftenitissupposedtoberenewed.Inthisexperimentforexample,NRT-BNbeingconstructedatthedesignatedintervalofTCON=2minuteswillnotbefeasibleforanyenvironmentwithmorethan60services.Incontrast,theconstructiontimeofKERT-BNremainsflatacrossdifferentnumberofnodes,asonlyrelativelycheapparameterlearningamongservicenodesisrequired.TherighthalfofthefigureoncemoreconfirmsthatKERT-BNachievesgreateraccuracythanNRT-BNundervarioussettings(numberofservicesinthisexperiment).
700KERT−BN600NRT−BN0−0.2−0.4x 104
KERT−BNNRT−BNConstruction time (x seconds)Data−fitting accuracy020406080100500−0.6−0.8−1−1.2−1.4−1.6400300200100−1.8−20204060801000Number of BN nodesNumber of BN nodes
Figure4.KERT-BNperformancevs.NRT-BNperformancewithdifferentenvironmentsizes(i.e.servicenumbers).
4.3Decentralizedparameterlearningvs.centralizedparameterlearning
HavingshownthatKERT-BNoutperformsNRT-BNinvariouscentralizedlearningsituations,thisadvantageisfurtherextendedbylearningtheunknownKERT-BNparametersinadecentral-izedmanner,asisproposedinSubsection3.4.Tothisend,thetimetakentolearneachCPDinKERT-BNsismeasured.SincetheseCPDswillbecomputedinparallelonmonitoringagentsinpractice,thedecentralizedlearningtimeisthemaximumofindividuallearningtimesacrossallCPDs,andiscomparedagainstthetimeforregularcentralizedKERT-BNparameterlearninginFigure5.TheaccuracyofthesetwoKERT-BNparameterlearningmethodsisnotplottedonthegroundsthattheyproduceprincipallythesameparameters.
ForeachKERT-BNsize,theparametersof20randomlygeneratedKERT-BNsarelearned,withtheaveragedecentralizedlearningtimeandaveragecentralizedlearningtimeplottedintheFigure.Itcanbeseenthatthedecentralizedlearningtimeisconstantlysuperiortothecentralizedlearningtime.Thefigurealsorevealsanobvioustrendofthissuperioritybecomingmoreandmoreconsiderableasthenumberofservices(thusthenumberofKERT-BNCPDs)increase.
5Applications
ThissectiondemonstratestheuseofKERT-BNintacklingreal-worldperformancemanage-mentproblemsontheeDiaMoNDtest-bed,andfurtherjustifiestheapproachpresentedinthispa-local,imagelocal,per.TheeDiaMoNDservicesshowninFigure1,ogsa
ogsaremote,andimageremotewerehostedbyfourAIXmachineswithtwo3.0GHZdualcoreCPUsand2GBmemory.Theimagelistserviceswererunontwoseparate3.0GHZdualcoreCPUsofaRedhatLinuxserverwith1GBmemory.Sincetheentiretest-bediswithinthesamesub-net,extraroutingwasimposedbetweenimage
locator
4.5KERT−BN parameter learning time (x seconds)Decentralized learning43.532.521.510.50Centralized learning0510KERT−BN sizes
1520253035Figure5.Decentralizedlearningtimevs.centralizedlearningtimeforvariousenvironmentsizes(i.e.servicenumbers).
GiventhateDiaMoNDfostersarelativelydedicatedandstablesystemwhereTDATA=20seconds,abiggercorrelationmetricK=10andalongerconstructionintervalTCON=20minutes(αModel=120)werechosen.Discretemodelsarebuiltfortworeasons:1)therearecompara-tivelymanydatapointsandtheymaynotfitGaussiandistributionsinpractice;2)MATLABBNTdoesnotsupportnon-lineardeterministicCPDsthatcontainmaximumrelationships.5.1dComp:Compensatingformissingdata
InlargedistributedsystemsliketheGrid,performancedatafromsomesystemcomponentsmaygomissingdueto1)lackofinstrumentation,2)failureintheactofdatareportingor3)theneedtoreducemonitoringoverhead.Tocopewiththeconsequences,mechanismsmustbeprovidedtosupportestimationofperformancestatesofunobservablecomponentsusingdatafromtheobservableones.
Typically,theonlyknowledgeavailableaboutelapsedtimeonunobservablecomponents(whoselatestperformancedatagomissing)isfromhistoricalmeasurementsorcomponentproviders.Suchpriorknowledgeislikelytobeobsolete,impreciseornotavailable.dComp,atechniquethatap-pliesKERT-BNtoupdatethispriorknowledgewithcurrentelapsedtimemeasurementsfromob-servablecomponentsaswellasresponsetimemeasurements,isproposedtoaddressthisproblem.dCompcomputesaPosterior(Probability)Distributionp(Y|O=E(o))usingstandardBN-basedinference[13]basedonEquation3foreachYwherenoelapsedtimedatawasavailable(OistheobservablesetofservicesandE(o)isthecurrentmeasurementmean).Intheabsenceofusingfullblownfill-inmethods(likeExpectationMaximization),itsufficestojustusethesummaryofobservationstatistics(themean,E(o))toassessY.
ToillustratedComp,theKERT-BNinFigure2isconsidered.Noserviceisassumedtobeobservable.dCompisemployedtoinfertheposteriordistributionforX4usingobservationsontherestofthevariables.Figure6(a)comparestheinferredposteriordistributionofX4withthepriordistributionofX4,anddemonstrateshowtheposteriordistributionhasupdatedourknowledge
aboutX4.Inthiscase,theposteriordistributionhasshifted(fromthepriordistribution)towardtheactualelapsedtimeandbecomemoredeterministicwithanarrowershape.
10.90.80.7Prior distributionPosterior distributionObservation10.90.80.7Prior distributionPosterior distributionObservationProbability0.60.50.40.30.20.100102030405060Probability0.60.50.40.30.20.100102030405060708090Elapsed time (x0.01 second)Response time (x0.01 second)(a)dCompexample:posteriordistributionofX4vs.(b)pAccelexample:responsetimehadX4per-priordistributionofX4.formednormallyvs.observedresponsetime.
Figure6.ExamplesofdCompandpAccel.
5.2pAccel:Acceleratingkeyserviceperformance
Duetothecomplexityofservice-orientedenvironments,asignificantperformanceboostforaparticularservicemaynotleadtosystem-widebenefits.Forinstance,ifserviceAisbeinginvokedinparallelwithanotherserviceBthathasasignificantlylongerelapsedtime,reducingA’selapsedtimecandolittletoimprovetheoverallperformance.Amethodtoassesstheend-to-endimpactoflocalactionsisessentialtoavoidingspendingalotofeffortinacceleratingthoseservicesthatwillbringlittleresponsetimeimprovement.
pAccelisameanstoachievethisgoalthroughtheuseofKERT-BN.BasedonEquation3,pAccelcomputestheposteriorresponsetimedistributionp(D|Z=E(z))giventhemeanofapredictionabout,Ztheelapsedtimeofanyservice.Theobservationmeanissufficientforthesamereasonaswasarguedintheprevioussubsection.
Recalltheexampleintheprevioussubsection.AposteriorresponsetimedistributioncannowbecomputedwithX4reducedto90%(e.g.afterlocalresourceallocationactions).Figure6(b)showsthattheposteriorresponsetimeprovidesagoodapproximationoftheactualimprovedresponsetimemean.Thedifferencebetweentheposteriordistributionandthepriordistributioncanbeusedtogaugethebenefitofresourceactionsandguideautonomicdecisionmaking.5.3Justification
TheefficacyofKERT-BNduringmodelre-constructionsintheeDiaMoNDscenarioinFigure1isexaminedbycontrastingtheresponsetimedistributionprojectedusingpAccelwithKERT-BNandNRT-BNagainstrealresponsetimemeasurements.
Consideringthatbothhumanusersandautonomicsoftwarearelikelytobeinterestedinas-sessmentslike“Whatistheprobabilitythatresponsetimewillexceedthethreshold(s)?”,wedraw
thecomparisonusingRelativeThresholdViolationProbabilityErrordefinedasfollows:
=
|Pbn(D>h)−Preal(D>h)|
fromsomeblack-boxcomponents.2)modelconstructioncanbepronetohumanerrors;3)modelsbuiltareaprioriandmaynotadaptwelltoworkloadorsystemchanges;4)assumptionsthatdeviatesignificantlyfromrealsystemconditions(e.g.thesystembeinginstablestate[19])mayhavetobemade.
Thesecondschoolofworkscontainemergingresearchrootedinstatisticallearningtechniquesusingperformancedatacollectedthroughinstrumentation.Forexample,tree-augmentednaiveBayesiannetworksarelearnedfromdatatocorrelatesystemcomponentresourcestate(CPUusage,memoryusageetc.)withservicelevelagreementviolation/compliancestates[8].Unlikethoseinthefirstschool,statisticallylearnedmodelsmostlyassumeverylittleornodomainknowledgeandcanbeprogrammaticallyupdated/reconstructedinresponsetosystemchanges.Notonlyarethesemodelsoftenlearnedentirelyfromscratchatpotentiallysubstantialcomputationalcost,buttheycanberathersensitivetonoisyormissingdata.
Theapproachdescribedinthispaperovercomestheseweaknesses,byincorporatingdomainknowledgeinthestatisticallearningframeworkanddecentralizingthelearningprocedure.Rishet.al[16]hasalsoadoptedaknowledge-enhancedstrategytoconstructafaultdeterminationmodel.However,applyingRish’sstrategyherewouldbedifficultgivenitsdependencyonbinarystatevariables,binarydomainrelationshipsandstrongindependenceassumptions.
7ConclusionsandFutureWork
Thispaperpresentsanautomatedandcost-effectivesolutiontoend-to-endperformancemod-elingforservice-orientedsystems,anessentialissuetowardsprovidingautonomiccapabilitiesintheseenvironments.Weleverageacombinationofeasilyattainabledomainknowledge(mainlytheworkflowinthispaper)andperformancedatacollectedsystem-wide,toinduceaknowledge-enhancedresponsetimeBayesiannetwork(KERT-BN)thataccuratelycorrelatessystemcompo-nentperformancetoend-to-endQoSgoals.Formodelelementsthatcannotbedeterminedinthisway,adecentralizedlearningmechanismisputinplacetoreducelearningoverheadincurred.Experimentsinbothsimulationandreal-worldcontextsshowthattheutilizationofdomainknowl-edge1)largelyreducesthemodelbuildingcostwithoutsacrificingmodelaccuracy;and2)rapidlyguidesthemodeltrainingprocesstowardsanaccurateproductwithfewdata.Thesefeaturesareparticularlyvaluableinhighlydynamicandcomplexenvironmentswherethecurrentmodelquicklyexpiresandmustberebuilt.
Theapproachisautomatedthroughemployinganinstrumentation-basedtechniquetotheau-tomaticextractionofworkflowandresourcesharinginformation[20],programmaticallyfeedingthisknowledgetogetherwithperformancedataintotheBayesiannetworkframework,andsub-sequentlyinducingthemodelunderaperiodicalscheme.Theautomatedmodelconstructioncanservetofurtherminimizetheneedforhumanengagementsinmodel-drivenproceduresandsteerthemclosertocompleteautonomicsolutions.
TheKERT-BNapproachcanbeeffortlesslygeneralizedandappliedtoanyinstrumenteddis-tributedsystemfeaturingusertransactionsthattraversesystemcomponents.Withminoradjust-ments,itmayalsobeusedtomodeltherelationshipbetweencomponent-levelmetricsotherthanelapsedtime(e.g.CPUormemoryusage)andend-to-endperformancegoals.Theseissuesarepromisingtopicsoffuturework.
Anotherimportantextensionofourworkisemployingdomainknowledgeanddecentraliza-tiontechniquestoreducethecostofprobabilityassessmentusingaconstructedmodel.Crucial
autonomicroutinessuchasresourceprovisioningandproblemlocalizationwillprofitgreatlyonrapidresponsetimeassessment.Addingthisvaluetotheworkpresentedinthispaperwillleadtoaccurateperformancemodelsthatareinexpensivetobuildanduse,andgoalongwaytofacilitateefficient,scalableautonomicsystemmanagement.
References
[1]P.Barham,A.Donnelly,R.Isaacs,andR.Mortier.Usingmagpieforrequestextractionandworkload
modelling.InProceedingsofthe6thSymposiumonOperatingSystemsDesignandImplementation(OSDI’04),,December2004.
[2]J.M.Brady,D.J.Gavaghan,A.C.Simpson,M.Mulet-Parada,andR.P.Highnam.eDiaMoND:A
grid-enabledfederateddatabaseofannotatedmammograms.InF.Berman,G.C.Fox,andA.J.G.Hey,editors,GridComputing:MakingtheGlobalInfrastructureaReality,pages923–943.WileySeries,2003.
[3]J.Cardoso,J.Miller,A.Sheth,andJ.Arnold.Modelingqualityofserviceforworkflowsandweb
serviceprocesses.Technicalreport,LSDISLab,ComputerScience,UniversityofGeorgia,May2002.[4]G.F.C.E.ClarkGlymour(Editor).Computation,CausationandDiscovery.MITPress,1999.
[5]G.CooperandE.Herskovits.Abayesianmethodfortheinductionofprobabilisticnetworksfromdata.
MachineLearning,pages309–347,1992.
[6]N.J.Dingle,P.G.Harrison,andW.J.Knottenbelt.Responsetimedensitiesingeneralisedstochastic
petrinetmodels.InWOSP’02:Proceedingsofthe3rdinternationalworkshoponSoftwareandperformance,pages46–54,NewYork,NY,USA,2002.ACMPress.
[7]N.FriedmanandM.Goldszmidt.SequentialupdateofBayesiannetworkstructure.In13thConf.on
UncertaintyinArtificialIntelligence,pages165–174,1997.
[8]C.Ira,G.Moises,K.Terence,S.Julie,andC.Jeffrey.Correlatinginstrumentationdatatosystem
states:Abuildingblockforautomateddiagnosisandcontrol.Technicalreport,HPLabs,November2004.
[9]J.O.KephartandD.M.Chess.Thevisionofautonomiccomputing.Computer,36(1):41–50,2003.[10]Y.Lu,T.Abdelzaher,C.Lu,L.Sha,andX.Liu.Feedbackcontrolwithqueueing-theoreticprediction
forrelativedelayguaranteesinwebservers.InRTAS’03:ProceedingsoftheThe9thIEEEReal-TimeandEmbeddedTechnologyandApplicationsSymposium,page208,Washington,DC,USA,2003.IEEEComputerSociety.
[11]C.Moler.NumericalComputingwithMATLAB.SocietyforIndustrialandAppliedMathematics,
2004.
[12]K.Murphy.Bayesnettoolboxformatlab.ComputerScienceandStatistics,2001.[13]R.Neapolitan.ProbabilisticReasoninginExpertSystems.WileyInterscience,1989.
[14]M.P.PapazoglouandD.Georgakopoulos.Introduction.Commun.ACM,46(10):24–28,2003.
[15]N.M.P.PeterG.Harrison.Performancemodellingofcommunicationnetworksandcomputerarchi-tectures.Wokingham,Addison-Wesley,1992.
[16]I.Rish,M.Brodie,andS.Ma.Accuracyvs.efficiencytrade-offsinprobabilisticdiagnosis.InEigh-teenthnationalconferenceonArtificialintelligence,pages560–566.AmericanAssociationforArtifi-cialIntelligence,2002.
[17]S.RussellandP.Norvig.ArtificialIntelligence:AMordenApproach.PrenticeHall,2003.
[18]D.J.SpiegelhalterandS.L.Lauritzen.Sequentialupdatingofconditionalprobabilitiesondirected
graphicalstructures.Networks,pages579–605,1990.
[19]B.Urgaonkar,G.Pacifici,P.Shenoy,M.Spreitzer,andA.Tantawi.Ananalyticalmodelformulti-tier
internetservicesanditsapplications.SIGMETRICSPerform.Eval.Rev.,33(1):291–302,2005.
[20]R.Zhang,S.Heisig,S.Moyle,andS.McKeever.Ogsa-basedgridworkloadmonitoring.InPro-ceedingsofthe5thIEEEInternationalSymposiumonClusterandGridComputing(CCGrid05).IEEEComputerSocietyPress,April2006.
因篇幅问题不能全部显示,请点此查看更多更全内容