hadoop集群下海量小文件优化处理word论文
AbstractWiththedevelopmentofMobileInternetandInternetofThings,theamountof dataontheinternetisgrowing exponentially,andthetraditionaltechnicalarchitecture inprocessingmassiveamountofdatahasbecomeweak.Hadoop,asatechnical frameworkthatcanprocessmassivedataefficiently,hasreceivedmoreandmore attentionbytheindustry.Hadoopusesmaster-slavesarchitecturedesignpatternand consistsofHDFSfilesystemandMapReducecomputingframework.Thesingle namenodedesignofHDFSfilesystemcansimplifythemanagementoffilesystems, butitalsoleadstothelowefficiencyofprocessingsmallfiles.Basedonthestudyon howtoprocessmassivesmallfilesinindustrialandacademiccirclesandthe technicaldetailsofHadoopandtheecologicalsystem,thispagepresentsproblems thatthecurrentsolutiondoesnottakethediversityandrepeatabilityofthefiletype intoconsiderationanddoesnotthoroughlysolvethesinglepointproblemofHadoop cluster.Therefore,thisthesisputsforwardaplantooptimizetheHadoopcluster usingtherelatedcomponentsofHadooptoimprovetheperformanceofprocessing mass small files.Inthisthesis,theMD5algorithmisusedtodeterminewhethertwofilesare duplicateby comparingcontentof thetwofiles.Inthisway,itcanreducethenumber ofwrittenfilestoreducetheconsumptionofthedisk.Inthisthesis,MapFileisused tomergesmallfilesandstoredifferentfilesbyfiles’size.Ifthefileissmall,itwillbe putinmulti-levelmergersqueueaccordingtothefiletype.Whenthequeuethreshold isreached,thesmallfilesofthequeuewillbemergedandwrittenintotheHDFS.In thisway,itcanreducethenumberoffilesatacertaindegree.Inthisthesis,HBaseis usedtostorepersistindexinformation.Itwillnotonlyensureeffectivelevelofdata readingandwriting,butalsoprovidestableservicesoutsidebyusingthecacheto storetheindexandprovidingtheconsistencyprotectionforthedataincacheand Indexer.Thisthesispresentsa"mark-delete-compress"methodtodeletefiles.When receivedthedeletionrequest,itwillmodifytheflagofthesmallfileincache.When thesmallfileisdeleted,theclusterwillcompressthelargefilewhichthesmallfileIIlocatesin.Bythismethod,ononehand,thedeletionr