Hadoop DI Benchmark

Hadoop Integration Benchmark

Product Profile and Evaluation: Talend and Informatica By William McKnight and Jake Dolezal October 2015

Sponsored by

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 2

Executive Summary

Hadoophasbecomeincreasinglyprevalentoverthepastdecadeontheinformationmanagementlandscapebyempoweringdigitalstrategieswithlargescalepossibilities.WhileHadoophelpsdata-driven(orthosewhodesiretobe)organizationsovercomeobstaclesofstorageandprocessingscaleandcostindealingwithever-increasingvolumesofdata,itdoesnot(inandofitself)solvethechallengeofhighperformancedataintegration—anactivitythatconsumesanestimated80%ormoreoftimeingettingactionableinsightsfrombigdata.Moderndataintegrationtoolswerebuiltinaworldaboundwithstructureddata,relationaldatabases,anddatawarehouses.ThebigdataandHadoopparadigmshifthaschangedanddisruptedsomeofthewayswederivebusinessvaluefromdata.Unfortunately,thedataintegrationtoollandscapehaslaggedbehindinthisshift.Earlyadoptersofbigdatafortheirenterprisearchitecturehaveonlyrecentlyfoundsomevarietyandchoicesindataintegrationtoolsandcapabilitiestoaccompanytheirincreaseddatastoragecapabilities.Hadoophasevolvedgreatlysinceitsearlydaysofmerelybatch-processinglargevolumesofdatainanaffordableandscalableway.NumerousapproachesandtoolshavearisentoelevateHadoop’scapabilitiesandpotentialuse-cases.OneofthemoreexcitingnewtoolsisSpark—amulti-stage,in-memoryprocessingenginethatoffershigh-performanceandmuchfasterspeedsthanitsdisk-basedpredecessors.Evenwhilereachingouttograspalltheseexcitingcapabilities,companiesstillhavetheirfeetfirmlyplantedintheoldparadigmofrelational,structured,OLTPsystemsthatruntheirday-in-day-outbusiness.Thatworldisandwillbearoundforalongtime.Thekeythenistoomarrycapabilitiesandbringthesetwoworldstogether.Dataintegrationisthatkey—tobringthetransactionalandmasterdatafromtraditionalSQL-based,relationaldatabasesandthebigdatafromavastarrayandvarietyofsourcestogether.ManydataintegrationvendorshaverecognizedthiskeyandhavesteppeduptotheplatebyintroducingbigdataandHadoopcapabilitiestotheirtoolsets.Theideaistogivedataintegrationspecialiststheabilitytoharnessthesetoolsjustliketheywouldthetraditionalsourcesandtransformationstheyareusedto.Withmanyvendorsthrowingtheirhatinthebigdataarena,itwillbeincreasinglychallengingtoidentifyandselecttheright/besttool.ThekeydifferentiatorstowatchwillthedepthbywhichatoolleveragesHadoopandtheperformanceofintegrationjobs.Asvolumesofdatatobeintegratedexpands,sotoowilltheprocessingtimesofintegrationjobs.Thiscouldspellthedifferencefora“just-in-time”answertoabusinessquestionanda“too-little-too-late”result.



Thebenchmarkpresentedherefocusesonthesetwodifferentiators.ThevendorschosenforthisbenchmarkareTalendandInformatica.Thebenchmarkwasdesignedtosimulateasetofbasicscenariostoanswersomefundamentalbusinessquestionsthatanorganizationfromnearlyanyindustrysectormightencounterandask,asopposedtosomenichescenariooran“academicexercise”use-case.Indoingso,wehopedtoidentify,notonlyperformance,bututilityaswell.TheresultsofthebenchmarkshowSparktobetherealwinner—evidencedbythenearly8xperformanceadvantageinlargestscalescenario.Basedonthetrends,wealsoanticipatethegaptowidenevenfurtherwithlargerdatasetsandlargerclusterenvironments.Inaddition,thebenchmarkalsorevealstheadvantageofleveragingSparkdirectlythroughtheintegrationtool,asopposedtoattemptingtothroughanothermedium(Hive,inthiscase),whichprovedtobefutileduetolackofsupportabilitybyevenenterprisedistributionsofHadoop.



About the Authors

WilliamisPresidentofMcKnightConsultingGroupGlobalServices(www.mcknightcg.com).Heisaninternationallyrecognizedauthorityininformationmanagement.HisconsultingworkhasincludedmanyoftheGlobal2000andnumerousmidmarketcompanies.Histeamshavewonseveralbestpracticecompetitionsfortheirimplementationsandmanyofhisclientshavegonepublicwiththeirsuccessstories.Hisstrategiesformtheinformationmanagementplanforleadingcompaniesinvariousindustries.Williamisauthorofthebook“InformationManagement:StrategiesforGainingaCompetitiveAdvantagewithData”.Williamisaverypopularspeakerworldwideandaprolificwriterwithhundredsofarticlesandwhitepaperspublished.Williamisadistinguishedentrepreneur,andaformerFortune50technologyexecutiveandsoftwareengineer.Heprovidesclientswithstrategies,architectures,platformandtoolselection,andcompleteprogramstomanageinformation.

JakeDolezalhasover16yearsexperienceintheInformationManagementfieldwithexpertiseinbusinessintelligence,analytics,datawarehousing,statistics,datamodelingandintegration,datavisualization,masterdatamanagement,anddataquality.Jakehasexperienceacrossabroadarrayofindustries,including:healthcare,education,government,manufacturing,engineering,hospitalityandgaming.JakeearnedhisDoctorateinInformationManagementfromSyracuseUniversity—consistentlythe#1rankedgraduateschoolforinformationsystemsbyU.S.NewsandWorldReport.HeisalsoaCertifiedBusinessIntelligenceProfessionalthroughTDWIwithanemphasisinDataAnalysis.Inaddition,heisacertifiedleadershipcoachandhashelpedclientsacceleratetheircareersandearnseveralexecutivepromotions.JakeisPracticeLeadatMcKnightConsultingGroupGlobalServices.



About MCG Global Services

Withaclientlistthatisthe“Alist”ofcomplex,politicallysustainableandsuccessfulinformationmanagement,MCGhasbroadinformationmanagementmarkettouchpoints.Ouradviceisaninfusionofthelatestbestpracticesculledfromrecent,personalexperience.Itispractical,nottheoretical.Weanticipateourcustomer’sneedswellintothefuturewithourfulllifecycleapproach.Ourfocused,experiencedteamsgenerateefficient,economic,timelyandpoliticallysustainableresultsforourclients.

• Wetakeakeenfocusonbusinessjustification.• Wetakeaprogram,notaproject,approach.• Webelieveinamodelofblendingwithclientstaffandwetakeafocusonknowledge

transfer.• Weengineerclientworkforcesandprocessestocarryforward.• We’revendorneutralsoyoucanrestassuredthatouradviceiscompletelyclientoriented.• Weknow,define,judgeandpromotebestpractices.• Wehaveencounteredandovercomemostconceivableinformationmanagementchallenges.• Weensurebusinessresultsaredeliveredearlyandoften.

MCGservicesspanstrategy,implementationandtrainingforturninginformationintotheassetitneedstobeforyourorganization.Westrategize,designanddeployinthedisciplinesofMasterDataManagement,BigData,DataWarehousing,AnalyticDatabasesandBusinessIntelligence.

Hadoop and Data Integration

ThisbenchmarkispartofresearchintotheperformanceofloadsonHadoopclusters-anincreasinglyimportantplatformforstoringthedatapoweringdigitalstrategies.Hadoopusagehasevolvedsincetheearlydayswhenthetechnologywasinventedtomakebatch-processingbigdataaffordableandscalable.Today,withalivelycommunityofopen-sourcecontributorsandvendorsinnovatingaplethoraoftoolsthatnativelysupportHadoopcomponents,usageanddataisexpanding.DataLakesembodytheideaofalldatastorednativelyinHadoop.Traditionallydatapreparationhasconsumedanestimated80%ofanalyticdevelopmentefforts.OneofthemostcommonusesofHadoopistodrivethisanalyticoverheaddown.DatapreparationcanbeaccomplishedthroughatraditionalETLprocess:extractingdatafromsources,transformingit(cleansing,normalizing,integrating)tomeetrequirementsofthedatawarehouseordownstream



repositoriesandapps,andloadingitintothosedestinations.Asintherelationaldatabaseworld,manyorganizationspreferELTprocesses,wherehigherperformanceisachievedbyperformingtransformationsafterloading.Insteadofburdeningthedatawarehousewiththisprocessing,however,youdotransformsinHadoop.Thisyieldshigh-performance,fault-tolerant,elasticprocessingwithoutdetractingfromqueryspeeds.InHadoopenvironments,youalsoneedmassiveprocessingpowerbecausetransformationsofteninvolveintegratingverydifferenttypesofdatafromamultitudeofsources.YouranalysesmightencompassdatafromERPandCRMsystems,in-memoryanalyticenvironments,andinternalandexternalappsviaAPIs.YoumightwanttoblendanddistilldatafromcustomermasterfileswithclickstreamdatastoredincloudsandsocialmediadatafromyourownNoSQLdatabasesoraccessedfromthird-partyaggregationservices.Youmightwanttoexaminevastquantitiesofhistoricaltransactionaldataalongwithdatastreaminginrealtimefromconsumertransactionsormachines,sensors,andchips.Hadoopcanencompassallofthisstructured,unstructured,semi-structured,andmulti-structureddatabecauseitallowsschema-lessstorageandprocessing.WhenbringingdataintoHadoop,there’snoneedtodeclareadatamodelormakeassociationswithtargetapplications.Insteadloose-bindingisusedtoapplyordiscoveradataschemaaftertransformation,whenthedataiswrittentothetargetproductionapplicationorspecializeddownstreamanalyticrepository.Thisseparationofdataintegrationprocessesfromtherun-timeenginemakesitpossibleto:

• Shareandreusetransformsacrossdataintegrationprojects• Viewdataindifferentwaysandexposenewdataelementsbyapplyingdifferentschema• Enhancedatasetsbyaddingnewdimensionsasnewinformationbecomesavailableor

additionalquestionsemergeThereareinnumerablesourcesofbigdataandmorespringingupallthetime.Sincebeinglimitedbylow-performingopensourceSqoop,Flume,commandlineHDFSandHiveintheearlydays,numerousapproachesandtoolshavearisentomeettheHadoopdataintegrationchallenge.WhilemuchhasbeenstudiedonqueryingthedatainHadoop,thesamecannotbesaidforgettingthedataintoHadoopclusters.ManyenterprisesarebriningthefeaturesandfunctionstheyareusedtointherelationalworldintotheHadoopworld,aswellasthenon-functionalrequirements,developerproductivity,metadatamanagement,documentationandbusinessrulesmanagementyougetfromamoderndataintegrationtool.VendorsbenchmarkedforthisreportareTalendandInformatica.



Talend Product Profile

CompanyProfile

ProductName TalendBigDataIntegration

InitialLaunch 2012

CurrentReleaseandDate

Version6.0,shippedSeptember2015

KeyFeaturesNativesupportforApacheSparkandSparkStreamingleveragingover100Sparkcomponents,oneclickconversionofMapReducejobstoSpark,supportforContinuousDelivery,MasterDataManagement(MDM)RESTAPIandQueryLanguageSupport,DataMaskingandSemanticAnalytics

HadoopDICompetitors Informatica,Pentaho,Syncsort

CompanyFounded 2006

Focus HighlyscalableintegrationsolutionsaddressingBigData,ApplicationIntegration,DataIntegration,DataQuality,MDM,BPM

Employees 500

Headquarters RedwoodCity,CA

Ownership Private



Informatica Product Profile

CompanyProfile

ProductName InformaticaBigDataEdition

InitialLaunch 2012

CurrentReleaseandDate Version9.6.1,shipped2015

KeyFeatures AllfunctionsonHadoop,UniversalDataAccess,High-SpeedDataIngestionandExtraction,Real-TimeDataCollectionandStreaming,DataDiscoveryonHadoop,NaturalLanguageProcessingonHadoop,AttunityReplicatefullyembedded;additionalGUIimprovements;moretargetplatforms

HadoopDICompetitors Talend,Pentaho,Syncsort,Tervela

Founded 1998

Focus World’snumberoneindependentproviderofdataintegrationsoftware.

Employees 5,500+

Headquarters RedwoodCity,CA

Ownership Private(previouslypublic)



Benchmark Overview

Theintentofthebenchmark’sdesignwastosimulateasetofbasicscenariostoanswersomefundamentalbusinessquestionsthatanorganizationfromnearlyanyindustrysectormightencounterandask.Thesecommonbusinessquestionsformulatedforthebenchmarkareasfollows:

• Whatimpactdoescustomers’viewsofpagesandproductsonourwebsitehaveonsales?Howmanypageviewsbeforetheymakeapurchasedecision(whetheronlineorin-store)?

• Howdoourcouponpromotionalcampaignsimpactourproductsalesorserviceutilization?Doourcustomerswhovieworreceiveourcouponpromotioncometoourwebsiteandbuymoreoradditionalproductstheymightnototherwisewithoutthecoupon?

• Howmuchdoesourrecommendationengineinfluenceordriveproductsales?Docustomerstendtobuyadditionalproductsbasedontheserecommendations?

Ourexperienceworkingwithclientsfrommanyindustrysectorsoverthepastdecadetellsusthesearenotnicheuse-cases,butcommoneverydaybusinessquestionssharedbymanycompanies.Thebenchmarkwasdesignedtodemonstratehowacompanymightapproachansweringthesebusinessquestionsbybringingdifferentsourcesofinformationintoplay.WealsohavetakentheopportunitytoshowhowHadoopcanbeleveraged,becausesomeofthedataofinterestinthisanalyticalcaseislikelyofalargevolumeandnon-relationalorsemi-toun-structuredinnature.Inthesecases,usingHadoopwouldlikelybeourbestcourseofactionwewouldrecommendtoaclientseekingtoanswerthesequestions.Thebenchmarkwasalsosetupasadataintegrationbenchmark,becauseitisalsohighlyprobablythatthedatarequiredresidesindifferentsource.Someofthesesourcesarealsoprobablynotbeingconsumedandaggregatedintoanenterprisedatawarehouse—duetothehighvolumeanddifficultyinintegratingvoluminousamountsofsemi-structureddataintoatraditionaldatawarehouse.Thus,thebenchmarkwasdesignedtomimicacommonscenarioandthechallengesfacedbyorganizationseekingtointegratedatatoaddresstheseandsimilarbusinessquestions.



Benchmark Setup

Thebenchmarkwasexecutedusingthefollowingsetup,environment,standards,andconfigurations.

Server Environment

Figure1:ServerEnvironmentandSetup

Feature Selection

HadoopDistribution ClouderaCDH5.4.7

EC2Instance Memoryoptimizedr3.xlarge(4vCPUs,30.5GBMemory,200GBStorage)

OS UbuntuLTS14.04(TrustyTahr)

SourceDataTypes Text-basedlogfilesandarelationaldatabase

DataVolume 20GB(Logfiles)and12,000,000rows(RDBMS)

TPC-HScaleFactor 0.125x–2x

RDBMS MySQL5.6

JavaVersion 1.6.0_87



ThebenchmarkwassetupusingAmazonWebServices(AWS)EC2InstancesdeployedintoanAWSVirtualPrivateCloud(VPC)withinthesamePlacementGroup.AccordingtoAmazon,allinstanceslaunchedwithinaPlacementGrouphavelowlatency,fullbisection,10Gigabitspersecondbandwidthbetweeninstances.ThefirstEC2instancewasamemory-optimizedr3.xlargewithfourvCPUsand30.5GBofRAM.Astoragevolumeof200GBwasattachedandmountedontheinstance.ThisLinuxinstancewasrunningUbuntuLTS14.04(nicknamedTrustyTahr).Onthisinstance,wedeployedClouderaCDH5.4.7asourHadoopdistributionalongwithMySQL5.6toserveasasourcerelationaldatabase.ThesecondEC2instancewasanr3.xlargemachinerunningWindowsServer2012.TheWindowsinstancewasinstalledwiththeTalendStudioandInformaticaBigDataEdition(BDE)developersuites.NOTE:InordertorunInformaticaBDE,athirdEC2instancewasusedtorunthedomainandrepositoryservicesrequiredforInformatica.ItwasaRedHatLinuxmachineandonlywasusedtoserveastheInformaticarepositoryservice.Itwasnotinvolvedinprocessingofdataorexecutionofthebenchmark.Therelationalsourceforthebenchmark(storedinMySQL5.6ontheLinuxinstance)wasconstructedusingtheTransactionProcessingPerformanceCouncilTPCBenchmarkH(TPC-H)Revision2.17.1StandardSpecification.TheTPC-Hdatabasewasconstructedtomimicareal-lifepoint-of-salesystemaccordingtotheentity-relationshipdiagramandthedatatypeandscalespecificationsprovidedbytheTPC-H.Thedatabasewaspopulatedbyscriptsthatwereseededwithrandomnumberstocreatethemockdataset.TheTPC-Hspecificationshaveascalefactorbywhichtherecordcountforeachtableisderived.Forthisbenchmark,weselectedamaximumscalefactorof2.Inthiscase,theTPC-Hdatabasecontained3millionrecordsintheORDERStableand12millionrecordsintheLINEITEMtable.



Relational Source

Figure2:TPC-HERDiagram©1993-2014TransactionProcessingPerformanceCouncil

Log Fi le Sources TheotherthreedatasourcestobeusedintheintegrationroutinesforthebenchmarkwerecreatedasASCIItext-based,semi-structuredlogfiles.Therewerethreelogfilescreatedtomimicreal-lifelogfilesforthreecases:

1. Web-clicklog

2. Couponlog

3. Recommendationenginelog

Web-cl ick log Aweb-clicklogwasgeneratedusingthesamefashionasastandardApachewebserverlogfile.Thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1)completelyrandompage



views(seededbyrandomnumbers)and2)web-clicksthatcorrespondtoactualpageviewsoforderedproducts(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”weblogentriesappearedinavarietyofpossibilitiesbutfollowingasimilarformatconsistentwithanApacheweb-clicklogentry.Alldatawererandomlyselected.Thedateswererestrictedtotheyear2015.Forexample:249.225.125.203 - anonymous [01/Jan/2015:16:02:10 -0700] "GET /images/footer-basement.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/index.php" "Windows NT 6.0"

The“signal”weblogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thehighlightedsegmentsrepresentthosethatpotentiallycorrespondtoactualorders:154.3.64.53 - anonymous [02/Jan/2015:06:03:09 -0700] "GET /images/side-ad.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/product-search.php?partkey=Q44271" "Android 4.1.2"

Theweb-clicklogfilecontained26,009,338entriesandwas5.1GBinsize.Therewasoneweb-clickentryforarandomLINEITEMrecordforeachandeveryORDERSrecord(3,000,000total).Therefore,12%oftheweb-clicklogentriescorrespondedtoorders.Therestoftheentrieswererandom.Coupon log AcouponlogwasgeneratedusingthesamefashionasacustomizedApachewebserverlogfile.Thecouponlogwasdesignedtomimicaspecialcaselogfilegeneratedwheneverapotentialcustomerviewedanitembecauseofaclick-throughfromacouponadcampaign.Again,thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1)completelyrandompageviews(seededbyrandomnumbers)and2)pageviewsthatcorrespondtoactualpageviewsoforderedproductsbyactualcustomersviathecouponadcampaign(seededbyrandomrecordsintheTPC-HORDERS,LINEITEMS,andCUSTOMERStables).The“dummy”or“noise”couponlogentrydatawererandomlyselected.Thedateswererestrictedtotheyear2015.The“signal”couponlogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thehighlightedsegmentsrepresentthosethatpotentiallycorrespondtoactualorders:49.243.50.31 - anonymous [01/Jan/2015:18:28:14 -0700] "GET /images/header-logo.png HTTP/1.0" 200 75422 "http://www.acmecompany.com/product-view.php?partkey=S22211" "https://www.coupontracker.com/campaignlog.php?couponid=LATEWINTER2015&customerid=C019713&trackingsnippet=LDGU-EOEF-LONX-WRTQ" "Windows Phone OS 7.5"



Thecouponlogfilecontained20,567,665entriesandwas6.6GBinsize.Therewasoneweb-clickentryforarandomLINEITEMrecordforeachandeveryORDERSrecord(3,000,000total).Therefore,15%ofthecouponlogentriescorrespondedtoorders.Therestoftheentrieswererandom.Recommendation log ArecommendationenginelogwasgeneratedasanXMLtaggedoutputlogfile.Therecommendationlogwasdesignedtomimicwheneverapotentialcustomerviewedanitembecauseofaclick-throughfromarecommendeditemonthecurrentitempage.Inthiscase,thecompany’swebsiterecommended5additionalorrelateditemsforeveryitemviewed.Again,thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1)“dummy”orcompletelyrandompageviewsandrandomrecommendations(seededbyrandomnumbers)and2)pageviewsthatcorrespondtoactualpageviewsoforderedproductsviatherecommendations(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”recommendationlogentrydatawererandomlyselected.Thedateswererestrictedtotheyear2015.The“signal”recommendationlogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thehighlightedsegmentsrepresentthosethatpotentiallycorrespondtoactualorders:<itemview>

<cartid>326586002341098</cartid> <timestamp>01/01/2015:16:53:07</timestamp> <itemid>Q48769</itemid> <referreditems>

<referreditem1> <referreditemid1>R33870</referreditemid1> </referreditem1> <referreditem2> <referreditemid2>P19289</referreditemid2> </referreditem2> <referreditem3> <referreditemid3>R28666</referreditemid3> </referreditem3> <referreditem4> <referreditemid4>P72469</referreditemid4> </referreditem4> <referreditem5> <referreditemid5>S23533</referreditemid5> </referreditem5>

</referreditems> <sessionid>CGYB-KWEH-YXQO-XXUG</sessionid> </itemview>

Therecommendationlogfilecontained13,168,336entriesandwas8.3GBinsize.Therewasoneweb-clickentryforarandomLINEITEMrecordforeachandeveryORDERSrecord(3,000,000total).Therefore,23%oftherecommendationlogentriescorrespondedtoorders.Therestoftheentrieswererandom.



Scale Factor Eachofthedatasourcesdescribedabovewerealsoscaledtodifferentscalefactors,sotheintegrationroutines(describedinthenextsection)couldbeexecutedagainstdatasourcesofvarioussizes.TheTPC-Hdatabaseandlogfileswerescaledinthefollowingmanner:

ScaleFactor 2.0 1.0 0.5 0.25 0.125LINEITEMrecords 12,000,000 6,000,000 3,000,000 1,500,000 750,000ORDERSrecords 3,000,000 1,500,000 750,000 375,000 187,500CUSTOMERSrecords 300,000 150,000 75,000 37,500 18,750LogSizeTotal 20GB 10GB 5GB 2.5GB 1.25GBWeb-clickLogSize 5.1GB 2.6GB 1.3GB 638MB 319MBWeb-clickLogEntries 26,009,338 13,004,669 6,502,335 3,251,167 1,625,584CouponLogSize 6.6GB 3.3GB 1.7GB 825MB 413MBCouponLogEntries 20,567,665 10,283,833 5,141,916 2,570,958 1,285,479RecommendationLogSize

8.3GB 4.2GB 2.1GB 1.0GB 519MB

RecommendationLogEntries

13,168,336 6,584,168 3,292,084 1,646,042 823,021

Table1:TPC-HDatabaseRecordCountsandLogFilesatDifferentScaleFactors

Integration Routines Thebigdatausecaseofthebenchmarkwasdesignedtodemonstratereal-lifescenarioswherecompaniesdesiretointegratedatafromtheirtransactionalsystemswithunstructuredandsemi-structuredbigdata.ThebenchmarkdemonstratesthisbyexecutingroutinesthatintegratetheTPC-Hrelationalsourcedatawiththeindividuallogfilesdescribedabove.Thefollowingintegrationroutineswerecreatedforthebenchmark.Inbothcases,bestpracticeswereobservedtooptimizetheperformanceofeachjob.Forexample,therewerenodatatypeconversions,datesweretreatedasstrings,andthesizesofdatatypeswereminimized.Informatica Workflow ThefollowingdiagramrepresentstheworkflowthatwascreatedinInformaticaDeveloper.



Figure3:OverviewofInformaticaWorkflowInformaticaBigDataEditionofferstwowaysofinteractingwithHadoop:viaHiveordirectlytotheHadoopFileSystem(HDFS).ThefirstpartwastheextractionoftheMySQLdatausingtheexactsamequeryagainstMySQLaswiththeTalendjob.ThisdatawasstoredinHadoopusingHive.Next,theweb,coupon,andrecommendationlogfileswereparsedusingthesameregularexpressionsaswillbeusedwiththeTalendjobsandloadeddirectlyintoHDFS.ThethirdpartconsistedoftheintegrationoftheTPC-HdatawiththelogfilesusingInformatica’sLookuptransformation.TheresultingdelimitedoutfileswerealsostoreddirectlyonHDFS.Theseroutineswillbedescribedingreaterdetailinthenextsection.Note:Atthetimeofthebenchmark,InformaticaBDEdidnothaveaMySQLconnectorincludedinthepackageweinstalled.AgenericODBCconnectorwasusedinstead.Thisdidnotseemtohaveanoverallimpactontheperformanceoftheintegrationjob.



Talend Job Design ThefollowingdiagramrepresentsthejobdesignthatwascreatedinTalendStudio.

Figure4:OverviewofTalendJobDesignTalendStudio6offersanewandpowerfulcapabilityofinteractingwithSparkonHadoopinadditiontotheconventionalapproachviaHiveordirectlytoHadoopFileSystem(HDFS).ThefirstpartwastheextractionoftheMySQLdatausingthefollowingqueryagainstMySQL:SELECT L_ORDERKEY, L_PARTKEY, O_CUSTKEY, O_ORDERDATE FROM LINEITEM LEFT OUTER JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY;

ThisdatawasstoredinSpark.Second,theweb,coupon,andrecommendationlogfileswereparsedusingregularexpressionsandloadeddirectlyintoSpark.ThethirdpartconsistedoftheintegrationoftheTPC-HdatawiththelogfilesusingTalend’stMaptransformation.TheresultingdelimitedfileswerestoreddirectlyontheHDFS.Informatica and Spark WeconsideredandevenattemptedrunningthebenchmarkusingInformaticaBDEonSparkviaHive.WewereabletoconfigureHivetoleverageSparkforexecutionusingtheClouderaCDHadministrationtools.However,thereweresignificantinconsistenciesinexecutionmakingthisusageinstable.AccordingtoCloudera,regardingtheuseofHiveonSparkthereisthefollowingdisclaimer:

Important:HiveonSparkisincludedinCDH5.4butisnotcurrentlysupportedorrecommendedforproductionuse.Ifyouareinterestedinthisfeature,tryitoutinatestenvironmentuntilweaddresstheissuesandlimitationsneededforproduction-readiness.



Therefore,thismethodwasdeemedunstableandanunsuitableusecaseatthetimeofthebenchmark.

Benchmark Results

Use Case 1: E-Inf luence Thegoalofthefirstusecaseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwiththepageviewsonthee-commercewebsite.Theintegrationjobswerewrittentomapthepageviewstoproductsordered.Thefollowingdiagramisaconceptualmappingoftheintegration.InInformatica,theLookuptransformationwasusedtolinktheweb-clicklogwithrecordsintheTPC-HdatabaseusingthefieldPARTKEY.InTalend,thetMaptransformationwasused.

Figure5:E-InfluenceMappingExecution Times ThefollowingtableliststheexecutiontimesoftheE-Influence(web-click)mappinginseconds.Web-click(UseCase1) SF2.0 SF1.0 SF0.5 SF0.25 SF0.125Informatica-MapReduce 5,221 1,977 622 308 111Talend-Spark 579 383 257 148 88Table2:E-InfluenceMappingExecutionTimes



Use Case 2: Coupon Inf luence Theobjectiveofthesecondusecaseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwithacouponadvertisementcampaign.Theintegrationjobswerewrittentomapthecoupon-ad-relatedpageviewsandcustomerstoproductsordered.Thefollowingdiagramisaconceptualmappingoftheintegration.InInformatica,theLookuptransformationwasusedtolinkthecouponlogwithrecordsintheTPC-HdatabaseusingthefieldPARTKEYandCUSTKEY.InTalend,thetMaptransformationwasused.

Figure6:CouponInfluenceMappingExecution Times ThefollowingtableliststheexecutiontimesoftheCouponInfluencemappinginseconds.Coupon(UseCase2) SF2.0 SF1.0 SF0.5 SF0.25 SF0.125Informatica-MapReduce 8,573 3,197 1,210 693 396Talend-Spark 882 478 331 202 169Table3:CouponMappingExecutionTimes



Use Case 3: Recommendation Inf luence Theintentofthethirduse-caseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwiththeinfluenceofarecommendationengine.Theintegrationjobswerewrittentomaptherecommendedproductstoproductsordered.Thefollowingdiagramisaconceptualmappingoftheintegration.InInformatica,theLookuptransformationwasusedtolinkthecouponlogwithrecordsintheTPC-HdatabaseusingthefieldPARTKEYfivetimes.InTalend,thetMaptransformationwasused.

Figure7:RecommendationInfluenceMappingExecution Times ThefollowingtableliststheexecutiontimesoftheRecommendationInfluencemappinginseconds.Recommendation(UseCase3) SF2.0 SF1.0 SF0.5 SF0.25 SF0.125Informatica-MapReduce 14,973 6,544 2,802 1,272 590Talend-Spark 1,313 701 473 236 188Table4:RecommendationMappingExecutionTimes



Results Summary Thefollowingtablesarethecompleteresultset.Allexecutiontimesaredisplayedinseconds(unlessotherwisenoted.)Scale Factor Summary SF2.0 SF1.0 SF0.5 SF0.25 SF0.125TPCHLineItemRecords 12,000,000 6,000,000 3,000,000 1,500,000 750,000LogFileSize(Total) 20GB 10GB 5GB 2.5GB 1.25GB

+ MapReduce Execution Times Informatica-MapReduce SF2.0 SF1.0 SF0.5 SF0.25 SF0.125LoadTPCH(MySQL)intoHadoop 859 711 349 164 95LoadWebLogintoHadoop 388 360 76 43 22LoadCouponLogintoHadoop 504 404 111 58 32LoadRecommendationLogintoHadoop 578 442 129 65 33Web-Click(UseCase1) 5,221 1,977 622 308 111Coupon(UseCase2) 8,573 3,197 1,210 693 396Recommendation(UseCase3) 14,973 6,544 2,802 1,272 590TOTAL 31,096 13,635 5,299 2,603 1,279TOTAL(Minutes) 518.3 227.3 88.3 43.4 21.3TOTAL(Hours) 8.6 3.8 1.5 0.7 0.4

+ Execution Times Talend-Spark SF2.0 SF1.0 SF0.5 SF0.25 SF0.125LoadTPCH(MySQL)intoSpark 748 642 302 143 82LoadWebLogintoSpark 158 79 33 21 12LoadCouponLogintoSpark 213 123 59 33 17LoadRecommendationLogintoSpark 204 117 55 31 16Web-Click(UseCase1) 579 383 257 148 88Coupon(UseCase2) 882 478 331 202 169Recommendation(UseCase3) 1,313 701 473 236 188TOTAL 4,097 2,523 1,510 814 572TOTAL(Minutes) 68.3 42.1 25.2 13.6 9.5TOTAL(Hours) 1.1 0.7 0.4 0.2 0.2Table5:ExecutionSummary



ThefollowingchartisacomparisonofthetotalexecutiontimesofboththeInformaticaworkflowandTalendjobstocompleteallloading,transformations,andfilegeneration.Regressionlineswereadded.

+ SF0.125 SF0.25 SF0.5 SF1.0 SF2.0TimesFaster 2.2x 3.2x 3.5x 5.4x 7.6xFigure8:ComparisonofTotalExecutionTimes

Conclusion

LeveragingSparkdirectlygivesaclearadvantageintheabilitytoprocessincreasinglylargerdatasets.Whenthesizeofthedatasetswasrelativelysmall(i.e.,lessthan3milliontransactionalrecordsand5GBlogfiles),thedifferenceinexecutiontimewasnegligible,becausebothtoolswereabletocompletethejobsinlessthananhour.Howeveroncethescalefactorof0.5wasreached,thedifferencebecameimpactful.Atascalefactorof2.0,theInformaticajobtookover8hourstocomplete.Increasingexecutiontimeswillbefrustratingforanalystsanddatascientistswhoareeagertoconducttheirdataexperimentsandadvancedanalytics,butwillbelefttowaitadayfortheirtargetdatasetstobeintegrated.Theadditionofcomplexitiesinthemappings(multiplelookupkeys)



seemedtofurtherexacerbatetheconcern.ByusingMapReduce,theLookuptransformationsdigsalargeperformanceholeforitselfasthelookupdatacacheandindexcachegetoverwhelmed.Byleveragingthein-memorycapabilitiesofSpark,userscanintegratedatasetsatmuchfasterrates.SparkusesfastRemoteProcedureCallsforefficienttaskdispatchingandscheduling.ItalsoleveragesathreadpoolforexecutionoftasksratherthanapoolofJavaVirtualMachineprocesses.ThisenablesSparktoscheduleandexecutetasksatratemeasuredinmilliseconds,whereasMapReduceschedulingtakessecondsandsometimesminutesinbusyclusters.SparkthroughHivewouldnotbeanapproachforachievingSparkresultsshownhere.Executionresultswillbeinconsistenttoindefinite.DevelopingagooduseofSparkforavendorisnotdissimilartodevelopingtheproductinitially.Ittakesyearsandmillionsofdollars.UtilizingthefullcapabilityofSparkforDataIntegrationwithHadoopisawinningapproach.

Documents

Hadoop DI Benchmark