python从登录的系统抓取数据_使用python登录网页来抓取数据

bd96500e110b49cbb3cd949968f18be7.png

I am trying to build a webscraper to extract my stats data from MWO Mercs. To do so it is necessary to login to the page and then go through the 6 different stats pages to get the data (this will go into a data base later but that is not my question).

The login form is given below (from https://mwomercs.com/login?return=/profile/stats?type=mech)- from what I see there are two fields that need data EMAIL and PASSWORD and need to be posted. It should then open http://mwomercs.com/profile/stats?type=mech . After that I need have a session to cycle through the various stats pages.

I have tried using urllib, mechanize and requests but I have been totally unable to find the right answer - I would prefer to use requests.

I do realise that similar questions have been asked in stackoverflow but I have searched for a very long time with no success.

Thank you for any help that could be provided

LOGIN

MechWarrior Online REGISTER

Email Address:

@

Password:

LOGIN

[ Forgot Your Password? ]

解决方案

The Requests documentation is very simple and easy to follow when it comes to submitting form data. Please give this a read-through: More Complicated POST requests

Logins usually come down to saving the cookie and sending it with future requests.

After you POST to the login page with requests.post(), use the request object to retieve the cookies. This is one way to do it:

post_headers = {'content-type': 'application/x-www-form-urlencoded'}

payload = {'username':username, 'password':password}

login_request = requests.post(login_url, data=payload, headers=post_headers)

cookie_dict = login_request.cookies.get_dict()

stats_reqest = requests.get(stats_url, cookies=cookie_dict)

If you still have problems, check the return code from the request with login_request.status_code or the page content for an error with login_request.text

Edit:

Some sites will redirect you several times when you make a request. Make sure to check the request.history object to see what happened and why you got bounced out. For example, I get redirects like this all of the time:

>>> some_request.history

(, )

Each item in the history tuple is another request. You can inspect them like normal requests objects, such as request.history[0].url and you can disable the redirects by putting allow_redirects=False in your request parameters:

login_request = requests.post(login_url, data=payload, headers=post_headers, allow_redirects=False)

In some cases, I've had to disallow redirects and add new cookies before progressing to the proper page. Try using something like this to keep your existing cookies and add the new cookies to it:

cookie_dict = dict(cookie_dict.items() + new_request.cookies.get_dict().items())

Doing this after each request will keep your cookies up-to-date for your next request, similar to how your browser would.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值