shell脚本处理大数据系列之(一)方法小结

最新推荐文章于 2023-02-07 10:00:00 发布

weixin_33873846

最新推荐文章于 2023-02-07 10:00:00 发布

阅读量206

点赞数

文章标签： shell 大数据 awk

原文链接：http://www.cnblogs.com/yaohaitao/p/5779091.html

版权

转自：http://longriver.me/?p=57

方法1：

单进程处理大规模的文件速度如（上million量级）比较慢，可以采用awk取模的方法，将文件分而治之，这样可以利用充分的利用多核CPU的优势

 
         for 
         ((i=0;i<5;i++)); 
         do           
        
         cat  
         query_ctx.20k |  
         awk  
         'NR%5==' 
         $i 
         ''  
         |\ 
        
         wc  
         -l  1> output_$i 2>err_$i & 
        
         done

方法2：

另外也可以使用split的方法，或者hashkey 的办法把大文件分而治之,
该办法的缺陷是需要对大文件预处理，这个划分大文件的过程是单进程，也比较的耗时

 
         infile=$1 
        
         opdir=querys 
        
         opfile=res 
        
         s=` 
         date  
         "+%s" 
         ` 
        
         while  
         read  
         line 
        
         do 
        
         imei=`. 
         /awk_c  
         "$line" 
         ` 
        
         no=`. 
         /tools/default  
         $imei 1000` 
        
         echo  
         $line >> $opdir/$opfile-$no 
        
         done 
         <$infile

方法3：

该方法是方法2的延伸，在预处理之后，可以使用shell脚本起多个进程来并行执行，当然为了防止进程之间因为并行造成的混乱输出，可以使用锁的办法，也可以通过划分命名的办法。下面的例子比较巧妙使用mv 操作。这一同步操作起到互斥锁的作用，使得增加进程更加灵活，只要机器资源够用，随时增加进程，都不会造成输出上的错误。

 
         output=hier_res 
        
         input=dbscan_res 
        
         prefix1=tmp- 
        
         prefix2=res- 
        
         for  
         file  
         in  
         ` 
         ls   
         $input 
         /res 
         *` 
        
         do 
        
         tmp=` 
         echo  
         ${ 
         file 
         #*-}` 
        
         ofile1=${prefix1}${tmp} 
        
         ofile2=${prefix2}${tmp} 
        
         if  
         [ ! -f $output/$ofile1 -a ! -f $output/$ofile2 ]; 
         then 
        
         touch  
         $output 
         /aaa_ 
         $tmp 
        
         mv  
         $output 
         /aaa_ 
         $tmp $output/$ofile1 
        
         if  
         [ $? - 
         eq  
         0 ]  
        
         then    
        
         echo  
         "dealing " 
         $ 
         file 
        
         cat  
         $ 
         file  
         | python hcluster.py 1> $output/$ofile1 2> hier.err 
        
         mv  
         $output/$ofile1 $output/$ofile2 
        
         fi      
        
         fi 
        
         done

转载于:https://www.cnblogs.com/yaohaitao/p/5779091.html

weixin_33873846

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
shell脚本处理大数据系列之(一)方法小结

转自：http://longriver.me/?p=57方法1：单进程处理大规模的文件速度如（上million量级）比较慢，可以采用awk取模的方法，将文件分而治之，这样可以利用充分的利用多核CPU的优势1234for((i=0;i<5;i++));docatquery_ctx.20k | aw...
复制链接

扫一扫