使用C# 通过账号密码 登录GitHub
最近打算用c#写一个爬虫,爬爬 GitHub、Gitee,其中就涉及到登录问题,当然,可以通过第三方OAuth登录,当然也是最好的,但是突然想到之前写的微博爬虫,也是通过账号/密码 去爬的,所以还是想通过这种方式来登录。本来打算网上看看帖子参考参考,结果找了半天,python的确实不少,但是没有C#方式的,所以打算自己写一篇。
实现原理
其实不管是哪一种语言,后台实现模拟登录的原理基本一样,步骤也差不多,基本都是:
1)抓包,一般都是用Fiddler之类的工具,先详细了解 浏览器访问过程
2)模拟,模拟一般是两部分,一部分是agent,就是模拟浏览器访问方式,包括浏览器类型(UserAgent),内容类型(ContentType)等,尽量模拟得和浏览器一模一样;另外就是cookie,需要保证整个访问过程中,cookiecontainer中不间断,让被访问端认为这些请求都是在一个session中。
先分析下访问步骤
对于github登录,涉及到页面只有两个:
https://github.com/login 以及 https://github.com/session
可以认为login页是展示页,session页是提交页
上图是login页面,在浏览器中输入 https://github.com/login, 并用fiddler来抓包,
需要注意的是,在响应中包含了一个名为:authenticity_token的input控件,它的值在 登录时需要用到,如下图:
这样就捋清楚了登录的流程:
先用get方法 访问 https://github.com/login, 然后解析响应,从中解析一个名为authenticity_token控件的值,
然后用post方法 访问 https://github.com/session, 并且需要在请求中 按上图所截取 插入这么多的参数。
当然需要注意cookie的保持。
接下来,我们来实现这个过程:
代码实现(framework的话,最好在4.5以上)
1)访问 /login
string url = "https://github.com/login";
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.ContentType = "application/x-www-form-urlencoded";//定义文档类型及编码
req.AllowAutoRedirect = false;//禁止自动跳转
//设置User-Agent,伪装成Google Chrome浏览器,
//千万注意不要写错了,要不然response就会多返回一个authenticity_token
req.UserAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36";
req.Timeout = 50000;//定义请求超时时间为5秒
req.KeepAlive = true;//启用长连接
req.Method = "GET";//定义请求方式为GET
以上都是在模拟浏览器访问,但是因为时https 请求,所以需要声明 Tls,所以在一开始地方加上
System.Net.ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3
| SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12;
这样请求就构造好了,接着只需要从响应中获取 控件的值就好,这里引入了包:HtmlAgilityPack
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
using (Stream stream = resp.GetResponseStream())
{
using (StreamReader reader=new StreamReader(stream,Encoding.UTF8))
{
string content = reader.ReadToEnd();
HtmlDocument document = new HtmlDocument();
document.LoadHtml(content);
var tokenNodes = document.DocumentNode.SelectNodes("//input[@name='authenticity_token']");
if (tokenNodes != null && tokenNodes.Count > 0)
{
string value = string.Empty;
result = tokenNodes[0].GetAttributeValue("value", value);
}
}
}
}
至此,就已经将authenticity_token 获取到了,可以应用于session页访问,当然,这里没有设置cookie,后面再说。
2)session页提交
首先也是模拟浏览器访问
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
//req.ContentType = "application/json;charset=UTF-8";//定义文档类型及编码
req.ContentType = "application/x-www-form-urlencoded";
req.AllowAutoRedirect = true;//禁止自动跳转
//设置User-Agent,伪装成Google Chrome浏览器
req.UserAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36";
req.Timeout = 50000;//定义请求超时时间为5秒
req.KeepAlive = true;//启用长连接
req.Method = "POST";//定义请求方式为GET
接下来,需要在请求中添加参数:
string param = "commit="+HttpUtility.UrlEncode("Sign in")+"&utf8="+HttpUtility.UrlEncode("✓").ToUpper()+"&authenticity_token=" +HttpUtility.UrlEncode(token) + "&login=" + HttpUtility.UrlEncode(_userEmail) + "&password=" + _pwd + "&webauthn-support=support";
var byteData = Encoding.UTF8.GetBytes(param);
req.ContentLength = byteData.Length;
using (Stream reqStream = req.GetRequestStream())
{
reqStream.Write(byteData, 0, byteData.Length);
reqStream.Flush();
}
其中 token 就是上一步骤中获取到的authenticity_token,用户名密码就不说了
后面就可以获取响应,处理响应了。
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
//result = resp.Cookies.ToString();
}
这样 这两步流程就走完了,但是如果按照这个流程访问的话,在访问 /session页的时候,会响应 422 Unprocessable Entity错误,原因就是还缺一步,cookie的设置。
3)cookie设置
所谓cookie设置,就是将上一个访问页中的响应cookie,添加到下一次请求的cookiecontainer中,保持cookie的传承,让被访问端认为是同一个session。
具体操作就是 在 /login的响应处理代码段中,加入:
//将响应中的所有cookie,都添加到cookieContainer中
foreach(Cookie cookie in resp.Cookies)
{
cookieContainer.Add(cookie);
}
//将响应头中cookie,也设置到cookiecontainer中
string cookieStr= resp.Headers["set-cookie"];
cookieContainer.SetCookies(new Uri("https://github.com"), cookieStr);
这样就保证了将/login响应时的所有cookie都加入了自定义的CookieContainer;
另外还需要在 /session的请求处理代码段中,加入:
req.CookieContainer = cookieContainer;
这样就保持了一致:
下面是完整的代码:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using Newtonsoft.Json;
using System.Web;
namespace AccaCrawler.Service
{
public class GitHubLoginHelper
{
private string _userEmail;
private string _pwd;
public GitHubLoginHelper(string userEmail,string pwd) {
_userEmail = userEmail;
_pwd = pwd;
}
/// <summary>
/// login处理方法 github登录,需要通过 https://github.com/login 获取一个 authenticity_token 作为登录的参数
/// </summary>
/// <returns></returns>
public string PreLogin(CookieContainer cookieContainer)
{
string url = "https://github.com/login";
string result = string.Empty;
System.Net.ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3
| SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12;
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.ContentType = "application/x-www-form-urlencoded";//定义文档类型及编码
req.AllowAutoRedirect = false;//禁止自动跳转
//设置User-Agent,伪装成Google Chrome浏览器,
//千万注意不要写错了,要不然response就会多返回一个authenticity_token
req.UserAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36";
req.Timeout = 50000;//定义请求超时时间为5秒
req.KeepAlive = true;//启用长连接
req.Method = "GET";//定义请求方式为GET
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
using (Stream stream = resp.GetResponseStream())
{
using (StreamReader reader=new StreamReader(stream,Encoding.UTF8))
{
string content = reader.ReadToEnd();
HtmlDocument document = new HtmlDocument();
document.LoadHtml(content);
var tokenNodes = document.DocumentNode.SelectNodes("//input[@name='authenticity_token']");
if (tokenNodes != null && tokenNodes.Count > 0)
{
string value = string.Empty;
result = tokenNodes[0].GetAttributeValue("value", value);
}
//将响应中的所有cookie,都添加到cookieContainer中
foreach(Cookie cookie in resp.Cookies)
{
cookieContainer.Add(cookie);
}
//将响应头中cookie,也设置到cookiecontainer中
string cookieStr= resp.Headers["set-cookie"];
cookieContainer.SetCookies(new Uri("https://github.com"), cookieStr);
}
}
}
return result;
}
/// <summary>
/// session处理方法 登录方法,Post提交
/// </summary>
/// <param name="token"></param>
/// <param name="cookieContainer"></param>
/// <returns></returns>
public string Login(string token,CookieContainer cookieContainer)
{
string url = "https://github.com/session";
string result = string.Empty;
System.Net.ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3
| SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12;
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
//req.ContentType = "application/json;charset=UTF-8";//定义文档类型及编码
req.ContentType = "application/x-www-form-urlencoded";
req.AllowAutoRedirect = true;//禁止自动跳转
//设置User-Agent,伪装成Google Chrome浏览器
req.UserAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36";
req.Timeout = 50000;//定义请求超时时间为5秒
req.KeepAlive = true;//启用长连接
req.Method = "POST";//定义请求方式为GET
req.CookieContainer = cookieContainer;
string param = "commit="+HttpUtility.UrlEncode("Sign in")+"&utf8="+HttpUtility.UrlEncode("✓").ToUpper()+"&authenticity_token=" +HttpUtility.UrlEncode(token) + "&login=" + HttpUtility.UrlEncode(_userEmail) + "&password=" + _pwd + "&webauthn-support=support";
var byteData = Encoding.UTF8.GetBytes(param);
req.ContentLength = byteData.Length;
using (Stream reqStream = req.GetRequestStream())
{
reqStream.Write(byteData, 0, byteData.Length);
reqStream.Flush();
}
//var dic = new Dictionary<string, string>
//{
// {"commit", "Sign in"},
// {"utf8", "✓"},
// {"authenticity_token",token},
// { "login",_userEmail},
// { "password",_pwd}
//};
//var jsonParam = JsonConvert.SerializeObject(new { commit = HttpUtility.UrlEncode("Sign in"), utf8 = HttpUtility.UrlEncode("✓").ToUpper(), authenticity_token = HttpUtility.UrlEncode(token).ToUpper(), login = HttpUtility.UrlEncode(_userEmail), password = HttpUtility.UrlEncode(_pwd) });
//jsonParam = System.Web.HttpUtility.HtmlEncode(jsonParam);
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
//result = resp.Cookies.ToString();
}
return result;
}
}
}
最后是调用方法:
GitHubLoginHelper helper = new GitHubLoginHelper("你的账号", "你的密码");
cookieContainer = new CookieContainer();
string token = helper.PreLogin(cookieContainer);
string cookie = helper.Login(token,cookieContainer);