日志解析(一) 大文件遍历

最新推荐文章于 2023-10-27 10:19:55 发布

weixin_30840573

最新推荐文章于 2023-10-27 10:19:55 发布

阅读量222

点赞数

原文链接：http://www.cnblogs.com/hudiefeifei/p/4316432.html

版权

这几天接了一个项目，log日志大数据处理，需要从日志里提取URL，进行http请求，检查该URL是否含流量统计代码。

这里做了文件遍历后，提取url写入到新的文件里，将三天的同名log日志合并为1个文件进行写入操作。

#遍历文件
def traverse_dir(file_path)
if File.directory? file_path
    Dir.foreach(file_path) do |file|
      if file !="." and file !=".."
        traverse_dir(file_path+"/"+file)
      end
    end
else

    puts File.basename(file_path)

    if File.basename(file_path).include?"-"
      #对文件名做处理，替换操作，将home_so_com-jiancai.log替换成url，home.so.com
      url1=File.basename(file_path).split("-")[0].gsub(/([_])/,'.')
    else
      url1=File.basename(file_path).split(".log")[0].gsub(/([_])/,'.')
    end

    #对文件名做处理，替换操作，将home_so_com-jiancai.log变成成home_so_com-jiancai
    file_name=File.basename(file_path).split(".")[0]
    #puts file_name
    pFile = File.open(file_name+".txt","a")
    urls=IO.readlines(file_path)
    arr=Array.new
    urls.each do|oneurl|
      if oneurl.strip !=" "
        url6 = oneurl.split(" ")
        if url6[6]!="/" or url6[6]!="//"
          begin
            #arr<<url6[6].match(/(?:.)+(?:[^\/])/)[0]
            arr<<url6[6]
          rescue
            puts "error"
          end
        end
      end
    end
    arr.uniq!
    arr.each do|url6|
      if url6.strip !=""
        #对URL进行关键词过滤
        unless url6.include?("Interface") or url6.include?("Ajax") or url6.include?("ajax") or url6.include?("ashx") or url6.include?("InterFace") or url6.include?("interface") or url6.include?("news.aspx") or url6.include?("portal.aspx") or url6.include?("homepage0.aspx") or url6.include?("homepage1.aspx") or url6.include?("homepage2.aspx") or url6.include?("homepage3.aspx") or url6.include?("homepage4.aspx") or url6.include?("homepage5.aspx")
          pFile.puts url1+url6
        end
      end
    end
    pFile.close
end
end
traverse_dir('E:\home\test')