本项目github地址:https://github.com/wangqifan/ZhiHu
UserManage是获取用户信息的爬虫模块
public classUserManage
{private stringhtml;private stringurl_token;
}
构造函数
用户主页的uRL格式为"https://www.zhihu.com/people/"+url_token+"/following";
public UserManage(stringurltoken)
{
url_token=urltoken;
}
先封装一个获取html页面的方法
private boolGetHtml()
{string url="https://www.zhihu.com/people/"+url_token+"/following";
html=HttpHelp.DownLoadString(url);return !string.IsNullOrEmpty(html);
}
拿到了html页面,接下来是剥取页面中的JSON,借助HtmlAgilityPack
public voidanalyse()
{if(GetHtml())
{try{
Stopwatch watch= newStopwatch();
watch.Start();
HtmlDocument doc= newHtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlNode node= doc.GetElementbyId("data");
StringBuilder stringbuilder=new StringBuilder(node.GetAttributeValue("data-state", ""));
stringbuilder.Replace(""", "'");
stringbuilder.Replace("<", "
stringbuilder.Replace(">", ">");
watch.Stop();
Console.WriteLine("分析Html用了{0}毫秒", watch.ElapsedMilliseconds.ToString());
}catch(Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
}
添加用户的关注列表的链接
private void GetUserFlowerandNext(stringjson)
{string foollowed = "https://www.zhihu.com/api/v4/members/" + url_token + "/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20";string following = "https://www.zhihu.com/api/v4/members/" + url_token + "/followees?include=data%5B%2A%5D.answer_count%2Carticles_count%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=0";
RedisCore.PushIntoList(1, "nexturl", following);
RedisCore.PushIntoList(1, "nexturl", foollowed);
}
对json数据进一步剥取,只要用户的信息,借助JSON解析工具Newtonsoft.Json
private void GetUserInformation(stringjson)
{
JObject obj=JObject.Parse(json);string xpath = "['" + url_token + "']";
JToken tocken= obj.SelectToken("['entities']").SelectToken("['users']").SelectToken(xpath);
RedisCore.PushIntoList(2, "User", tocken.ToString());
}
现在来完成下analyse函数
public void analyse()
{
if (GetHtml())
{
try
{
Stopwatch watch = new Stopwatch();
watch.Start();
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlNode node = doc.GetElementbyId("data");
StringBuilder stringbuilder =new StringBuilder(node.GetAttributeValue("data-state", ""));
stringbuilder.Replace(""", "'");
stringbuilder.Replace("
stringbuilder.Replace(">", ">");
GetUserInformation(stringbuilder.ToString());
GetUserFlowerandNext(stringbuilder.ToString());
watch.Stop();
Console.WriteLine("分析Html用了{0}毫秒", watch.ElapsedMilliseconds.ToString());
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
}
}
UrlTask是从nexturl队列获取用户的关注列表的url,获取关注列表。服务器返回的Json的数据
封装一个对象的序列化和反序列化的类
public classSerializeHelper
{///
///对数据进行序列化///
///
///
public static string SerializeToString(objectvalue)
{returnJsonConvert.SerializeObject(value);
}///
///反序列化操作///
///
///
///
public static T DeserializeToObject(stringstr)
{return JsonConvert.DeserializeObject(str);
}
}
封装UrlTask类
public classUrlTask
{
private string url { get; set; }
private string JSONstring { get; set; }
public UrlTask(string_url)
{
url =_url;
}
}
添加一个获取资源的方法
private boolGetHtml()
{
JSONstring=HttpHelp.DownLoadString(url);
Console.WriteLine("Json下载完成");
return !string.IsNullOrEmpty(JSONstring);
}
解析json方法
public voidAnalyse()
{
try{
if(GetHtml())
{
Stopwatch watch = newStopwatch();
watch.Start();
followerResult result = SerializeHelper.DeserializeToObject(JSONstring);
if (!result.paging.is_end)
{
RedisCore.PushIntoList(1, "nexturl", result.paging.next);
}
foreach (var item inresult.data)
{
int type=Math.Abs(item.GetHashCode())% 3 + 3;
if (RedisCore.InsetIntoHash(type, "urltokenhash", item.url_token, "存在"))
{
RedisCore.PushIntoList(1, "urltoken", item.url_token);
}
}
watch.Stop();
Console.WriteLine("解析json用了{0}毫秒",watch.ElapsedMilliseconds.ToString());
}
}
catch(Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
解析:如果result.paging.is_end为true,那么这个是用户关注列表的最后一页,那么它的nexturl应该加入队列,负责不要加入,对于后面的用户数组,因为信息不去全,不要了,有了Id前往主页获取详细信息。
模块组合
封装一个一个方法,从队列拿到nextutl,前往用户的关注列表,拿到更多用户ID
private static voidGetNexturl()
{
string nexturl = RedisCore.PopFromList(1, "nexturl");
if (!string.IsNullOrEmpty(nexturl))
{
UrlTask task = newUrlTask(nexturl);
task.Analyse();
}
}
封装一个方法,循环从队列获取用户的urltoken(如果队列空了,执行GetNexturl),前往用户主页,获取信息
private static void GetUser(objectdata)
{
while (true)
{
string url_token = RedisCore.PopFromList(1, "urltoken");
Console.WriteLine(url_token);
if (!string.IsNullOrEmpty(url_token))
{
UserManage manage = newUserManage(url_token);
manage.analyse();
}
else{
GetNexturl();
}
}
}
在main函数里面执行这些方法,由于任务量大,采用多线程,线程数视情况而定
for (int i = 0; i < 10; i++)
{
ThreadPool.QueueUserWorkItem(GetUser);
}
添加种子数据,用于刚开始时候队列都是空的,需要添加种子数据
手动添加,在redile-cl.exe敲命令
在main函数中加入
UserTask task=newUserTask(“某个用户的uRLtoken”);
task.analyse();
执行一次之后要注释掉,避免重复