用户浏览新闻分析

在公司这个项目中,有幸获得了搜狐提供的一天的用户浏览记录,经过处理以后得到如下面格式的数据:

10000082	12002235,12002254,12002273,12002231,12002229
10000169	12002684
10000170	11990159,11964438,11967826,11962239,11993674,11994700,11988874,11980097,11869151,11989021,11988798
10000197	12003720,12005653,11995420,12005687,12002684,12004640,11922834,11996252,11993003,11992541
10000207	12003312,12003720,11993003,11988798,11989642,11981421,11987264,11992722,11991992,11990421,12003198,11993438,12002683,12003945,12003310,12002358,12007144,11823605,11822852
10000396	11989070
10000406	11991992,11993001,11986248,11993056,12002633,12002684,12003198,12003945
10000472	11984764,11935680,11958279,11885528,11945566
10000497	11988838,11979647,11963203,11975444,11976523,11976446,11974896,11978179,11977404,11994038,11989505,12001199,11992727,11989070,11989719,11989544,11990159,12003107,11963210,12004131,12006906,11980456,11989247,11989089,11988494,11978575,11977213,11976315,11964623,11964802,11988656
10000502	11988798,11988874,11981396

第一列为用户ID,后面为该用户当日浏览的新闻ID。

有了这些数据,最简单的想法是统计一下用户阅读新闻量的分布。

首先我们对用户阅读新闻量按照从高到低排序:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Arrays;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class GetTopUser {
	
	static Map<String, Integer> usermap = new HashMap<String, Integer>();
	
	static Map.Entry<String, Integer>[] array;
	
	public static void main(String[] args) throws IOException{
		
		
		FileReader fr = new FileReader("F:\\felven\\user_newsid.txt");
	    BufferedReader br = new BufferedReader(fr);
	    
	    File file = new File("F:\\fuser.csv");
	    file.createNewFile();
	    FileWriter fw = new FileWriter("F:\\fuser.csv");
	    BufferedWriter bw = new BufferedWriter(fw);
	    
	    String line=new String();
	    String[] outline=new String[4];
	    String[] newsid=new String[100002];
	    
	    while((line=br.readLine())!=null){
	    	outline=line.split("\t",-1);
	    	newsid=outline[1].split(",",-1);
	    	usermap.put(outline[0], newsid.length);
	    }
	    
	    
	    
	    array=getSortedHashtableByValue(usermap);
	    
	    
	    for(int i=0;i<array.length;i++){
      		 bw.write(array[i].getKey().toString()+","+array[i].getValue().toString());
      	     bw.newLine();
      	     bw.flush();
        }
	    
	    bw.close();
	    fw.close();
	    fr.close();
	    br.close();
		
	}
	
	public static Map.Entry[] getSortedHashtableByValue(Map h) {
        Set set = h.entrySet();
        Map.Entry[] entries = (Map.Entry[]) set.toArray(new Map.Entry[set.size()]);
        Arrays.sort(entries, new Comparator() {
            public int compare(Object arg0, Object arg1) {
                Integer key1 = Integer.valueOf(((Map.Entry) arg0).getValue().toString());
                Integer key2 = Integer.valueOf(((Map.Entry) arg1).getValue().toString());
                return key2.compareTo(key1);
            }
        });
        return entries;
    }

}

得到的输出如下:

5674681035476963338,4664
5674760244232720401,3328
20600187,2457
5674691824099266582,2199
27921511,1439
5687751514672599060,1144
5693365952146575377,901
5688933230787432458,897
5699392045035032598,869
51317069,857
5686681246776692770,853
5687367561260306461,812
53079175,774
5692903716043100164,773
5650687,770
5680990719402053652,767
18011207,757
46119937,723
5676210858208792603,632
16565521,623
5696793758671048720,567
51316830,556
25325106,555
17353008,554
5681349731125563425,554
21697370,552
4121,538
5691749455401848838,532
5687794320136998943,527
5677917920676548610,519
5692398325223919624,515

第一列为用户ID,第二列为该用户当日浏览的新闻数量,下面开始统计:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;


public class Interval {
	
	public static void main(String[] args) throws IOException{
		FileReader fr = new FileReader("F:\\felven\\fuser.csv");
	    BufferedReader br = new BufferedReader(fr);
	     
	    int count1=0;
	    int count2=0;
	    int count3=0;
	    int count4=0;
	    int count5=0;
	    int count6=0;
	    int count7=0;
	    int count8=0;
	    int count9=0;
	    int count10=0;
	    int count11=0;
	    int sum=0;
	    String line="";
	    String[] outline=new String[3];
	    while((line=br.readLine())!=null){
	    	outline=line.split(",");
	    	sum++;
	    	if(Integer.parseInt(outline[1])>1000){
	    		count1++;
	    	}
	    	else if(Integer.parseInt(outline[1])>900){
	    		count2++;
	    	}
	    	else if(Integer.parseInt(outline[1])>800){
	    		count3++;
	    	}
	    	else if(Integer.parseInt(outline[1])>700){
	    		count4++;
	    	}
	    	else if(Integer.parseInt(outline[1])>600){
	    		count5++;
	    	}
	    	else if(Integer.parseInt(outline[1])>500){
	    		count6++;
	    	}
	    	else if(Integer.parseInt(outline[1])>400){
	    		count7++;
	    	}
	    	else if(Integer.parseInt(outline[1])>300){
	    		count8++;
	    	}
	    	else if(Integer.parseInt(outline[1])>200){
	    		count9++;
	    	}
	    	else if(Integer.parseInt(outline[1])>100){
	    		count10++;
	    	}
	    	else{
	    		count11++;
	    	}
	    }
	    System.out.println("total user is "+sum);
	    System.out.println(">1000 is "+count1+" and percent is "+(double)count1/sum);
	    System.out.println(">900 is "+count2+" and percent is "+(double)count2/sum);
	    System.out.println(">800 is "+count3+" and percent is "+(double)count3/sum);
	    System.out.println(">700 is "+count4+" and percent is "+(double)count4/sum);
	    System.out.println(">600 is "+count5+" and percent is "+(double)count5/sum);
	    System.out.println(">500 is "+count6+" and percent is "+(double)count6/sum);
	    System.out.println(">400 is "+count7+" and percent is "+(double)count7/sum);
	    System.out.println(">300 is "+count8+" and percent is "+(double)count8/sum);
	    System.out.println(">200 is "+count9+" and percent is "+(double)count9/sum);
	    System.out.println(">100 is "+count10+" and percent is "+(double)count10/sum);
	    System.out.println("<100 is "+count11+" and percent is "+(double)count11/sum);
	}
}

得到的输出结果为:

total user is 2334825
>1000 is 6 and percent is 2.5697857441135845E-6
>900 is 1 and percent is 4.2829762401893076E-7
>800 is 5 and percent is 2.1414881200946537E-6
>700 is 6 and percent is 2.5697857441135845E-6
>600 is 2 and percent is 8.565952480378615E-7
>500 is 15 and percent is 6.424464360283961E-6
>400 is 31 and percent is 1.3277226344586854E-5
>300 is 115 and percent is 4.9254226762177035E-5
>200 is 231 and percent is 9.893675114837301E-5
>100 is 1560 and percent is 6.68144293469532E-4
<100 is 2332853 and percent is 0.9991553970854347


可以看到总共有233万用户,99.9%的用户新闻浏览量在100篇(包括100)以内,至于>1000的用户,比如最高的一天看4664篇新闻,完全可以认为这是一个爬虫。

然后我们再对100篇以内的用户进行细分:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;


public class Interval {
	
	public static void main(String[] args) throws IOException{
		FileReader fr = new FileReader("F:\\felven\\fsmall.csv");
	    BufferedReader br = new BufferedReader(fr);
	    
	     
	    int count1=0;
	    int count2=0;
	    int count3=0;
	    int count4=0;
	    int count5=0;
	    int count6=0;
	    int count7=0;
	    int count8=0;
	    int count9=0;
	    int count10=0;
	    int count11=0;
	    int sum=0;
	    String line="";
	    String[] outline=new String[3];
	    while((line=br.readLine())!=null){
	    	outline=line.split(",");
	    	sum++;
	    	if(Integer.parseInt(outline[1])==100){
	    		count1++;
	    	}
	    	else if(Integer.parseInt(outline[1])>90){
	    		count2++;
	    	}
	    	else if(Integer.parseInt(outline[1])>80){
	    		count3++;
	    	}
	    	else if(Integer.parseInt(outline[1])>70){
	    		count4++;
	    	}
	    	else if(Integer.parseInt(outline[1])>60){
	    		count5++;
	    	}
	    	else if(Integer.parseInt(outline[1])>50){
	    		count6++;
	    	}
	    	else if(Integer.parseInt(outline[1])>40){
	    		count7++;
	    	}
	    	else if(Integer.parseInt(outline[1])>30){
	    		count8++;
	    	}
	    	else if(Integer.parseInt(outline[1])>20){
	    		count9++;
	    	}
	    	else if(Integer.parseInt(outline[1])>10){
	    		count10++;
	    	}
	    	else{
	    		count11++;
	    	}
	    }
	    
	    br.close();
	    fr.close();
	    System.out.println("total user is "+sum);
	    System.out.println("=100 is "+count1+" and percent is "+(double)count1/sum);
	    System.out.println(">90 is "+count2+" and percent is "+(double)count2/sum);
	    System.out.println(">80 is "+count3+" and percent is "+(double)count3/sum);
	    System.out.println(">70 is "+count4+" and percent is "+(double)count4/sum);
	    System.out.println(">60 is "+count5+" and percent is "+(double)count5/sum);
	    System.out.println(">50 is "+count6+" and percent is "+(double)count6/sum);
	    System.out.println(">40 is "+count7+" and percent is "+(double)count7/sum);
	    System.out.println(">30 is "+count8+" and percent is "+(double)count8/sum);
	    System.out.println(">20 is "+count9+" and percent is "+(double)count9/sum);
	    System.out.println(">10 is "+count10+" and percent is "+(double)count10/sum);
	    System.out.println("<=10 is "+count11+" and percent is "+(double)count11/sum);
	}
}

得到的输出如下:

total user is 2332853
=100 is 54 and percent is 2.314762224623669E-5
>90 is 540 and percent is 2.314762224623669E-4
>80 is 915 and percent is 3.9222359917234393E-4
>70 is 1696 and percent is 7.270068024003227E-4
>60 is 2954 and percent is 0.0012662606688033922
>50 is 5822 and percent is 0.0024956566058812963
>40 is 13397 and percent is 0.005742753615422832
>30 is 33956 and percent is 0.014555567796170612
>20 is 101064 and percent is 0.04332206101284564
>10 is 356228 and percent is 0.15270057736171117
<=10 is 1816227 and percent is 0.7785432686928838


这里能够看到77.8%的用户每天阅读新闻量在10篇以内,可以再进一步细分:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;


public class Interval {
	
	public static void main(String[] args) throws IOException{
		FileReader fr = new FileReader("F:\\felven\\fssmall.csv");
	    BufferedReader br = new BufferedReader(fr);
	    

	     
	    int count1=0;
	    int count2=0;
	    int count3=0;
	    int count4=0;
	    int count5=0;
	    int count6=0;
	    int count7=0;
	    int count8=0;
	    int count9=0;
	    int count10=0;
	    int count11=0;
	    int sum=0;
	    String line="";
	    String[] outline=new String[3];
	    while((line=br.readLine())!=null){
	    	outline=line.split(",");
	    	sum++;
	    	if(Integer.parseInt(outline[1])==10){
	    		count1++;
	    	}
	    	else if(Integer.parseInt(outline[1])==9){
	    		count2++;
	    	}
	    	else if(Integer.parseInt(outline[1])==8){
	    		count3++;
	    	}
	    	else if(Integer.parseInt(outline[1])==7){
	    		count4++;
	    	}
	    	else if(Integer.parseInt(outline[1])==6){
	    		count5++;
	    	}
	    	else if(Integer.parseInt(outline[1])==5){
	    		count6++;
	    	}
	    	else if(Integer.parseInt(outline[1])==4){
	    		count7++;
	    	}
	    	else if(Integer.parseInt(outline[1])==3){
	    		count8++;
	    	}
	    	else if(Integer.parseInt(outline[1])==2){
	    		count9++;
	    	}
	    	else if(Integer.parseInt(outline[1])==1){
	    		count10++;
	    	}
	    	else{
	    		count11++;
	    	}
	    }
	    
	    br.close();
	    fr.close();
	    System.out.println("total user is "+sum);
	    System.out.println("=10 is "+count1+" and percent is "+(double)count1/sum);
	    System.out.println("=9 is "+count2+" and percent is "+(double)count2/sum);
	    System.out.println("=8 is "+count3+" and percent is "+(double)count3/sum);
	    System.out.println("=7 is "+count4+" and percent is "+(double)count4/sum);
	    System.out.println("=6 is "+count5+" and percent is "+(double)count5/sum);
	    System.out.println("=5 is "+count6+" and percent is "+(double)count6/sum);
	    System.out.println("=4 is "+count7+" and percent is "+(double)count7/sum);
	    System.out.println("=3 is "+count8+" and percent is "+(double)count8/sum);
	    System.out.println("=2 is "+count9+" and percent is "+(double)count9/sum);
	    System.out.println("=1 is "+count10+" and percent is "+(double)count10/sum);
	    System.out.println("=0 is "+count11+" and percent is "+(double)count11/sum);
	}
}

得到的结果如下:

total user is 1816227
=10 is 70543 and percent is 0.03884040926602236
=9 is 82179 and percent is 0.04524709741678766
=8 is 95689 and percent is 0.052685594917375414
=7 is 111435 and percent is 0.061355216060547495
=6 is 131828 and percent is 0.07258343808345542
=5 is 157174 and percent is 0.08653874212859956
=4 is 191520 and percent is 0.1054493738943425
=3 is 236860 and percent is 0.13041321376678136
=2 is 299791 and percent is 0.16506251696511504
=1 is 439208 and percent is 0.24182439750097318
=0 is 0 and percent is 0.0


最终我们可以发现大多是用户阅读新闻量在1篇-4篇之间,估计现实差不多也就是这样。



11.20 更新

来到搜狐之后,得到的数据更多,其中有搜狐新闻移动端的用户浏览记录,这里选择了11月2日-11月14日的数据进行分析。

首先是统计每一天有多少用户使用搜狐新闻客户端,前一列是日期,后一列是用户:

1114 6539125
1113 6395081
1112 6668126
1111 6142667
1110 5650577
1109 6259603
1108 6034332
1107 6399206
1106 6372263
1105 6288124
1104 6279249
1103 6238395
1102 5893482

从大体上来说,搜狐新闻客户端日活跃用户量在600万左右,还是比较给力的。

下面针对一天的数据进行分析,以11月14日的数据为例,我们统计出总的点击率(也就是点开新闻查看的次数),然后按照新闻阅读量对用户群进行分段,分别统计出每个区间内的点击率。

total click is 59749234
>1000 is 36 and percent is 0.0443303256406601
>=900 is 8 and percent is 1.28185743770372E-4
>=800 is 9 and percent is 1.2842005639771047E-4
>=700 is 15 and percent is 1.881195665203005E-4
>=600 is 26 and percent is 2.8095757679504307E-4
>=500 is 39 and percent is 3.55067313498948E-4
>=400 is 128 and percent is 9.214678802409417E-4
>=300 is 235 and percent is 0.001360285221397148
>=200 is 811 and percent is 0.003203873709912331
>=100 is 7805 and percent is 0.016746959467296266
<100 is 6530013 and percent is 0.9323563378235108

总的点击率接近6000万次,果然很给力。

和之前一样,能够发现大部分点击率都分布在阅读量不超过100篇的用户群内,于是继续加以分析:

total click is 55734477
=100 is 269 and percent is 4.8264559834301486E-4
>=90 is 12177 and percent is 0.0776632568024277
>=80 is 5312 and percent is 0.00802119305793432
>=70 is 8721 and percent is 0.011583189342567976
>=60 is 16171 and percent is 0.018557274700900128
>=50 is 31013 and percent is 0.0299886370154689
>=40 is 67823 and percent is 0.053355107288438355
>=30 is 159421 and percent is 0.09656979467125887
>=20 is 419290 and percent is 0.17809098666163137
>=10 is 1234776 and percent is 0.300725366096106
<10 is 4584152 and percent is 0.29699618424696084

我们发现阅读量在0-20范围内的用户群点击量最多,接近60%,这些用户可以作为典型用户进行分析,如果要缩小范围,可以只选择10-20区间段内的用户群进行分析即可。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值