php 模拟登录豆瓣,网页爬虫 - python模拟登入豆瓣问题

半自动模拟登入豆瓣

代码信息:

# /usr/bin/python

#coding:utf-8

__author__ = 'eyu Fanne'

import requests

from bs4 import BeautifulSoup

headers={

"Host":"www.douban.com",

"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0",

"Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",

"Accept-Encoding":"gzip, deflate",

"Connection":"keep-alive"

}

s=requests.session()

s.headers.update(headers)

html_url = s.get('https://www.douban.com/accounts/login',headers=headers)

print s.cookies.items()

print "html_url code %s" %html_url.status_code

html_txt = html_url.text

html_soup = BeautifulSoup(html_txt,'lxml')

img_soup = html_soup.find_all('img',class_="captcha_image")

for img_i in img_soup:

print img_i['src']

cap_img=img_i['src']

for i in html_soup.find_all("input",attrs={"name":"captcha-id"}):

print i['value']

cap_i = i['value']

captcha_solution=raw_input('输入验证码:')

captcha_id=cap_i

print captcha_solution

print captcha_id

url_data={

"source":"index_nav",

"form_email":"*********",

"form_password":"*******",

"captcha-solution":captcha_solution,

"captcha-id":captcha_id,

}

s_login=s.post(html_url,data=url_data,headers=headers)

print s.cookies.items()

账号密码用**代替了,执行时候会给出验证码图片,人为输入的

错误信息:

[('bid', '"X1c3XEWFnhQ"')]

html_url code 200

https://www.douban.com/misc/captcha?id=ArzwwQ6Yv33e0BU7MawrL62d:en&size=s

ArzwwQ6Yv33e0BU7MawrL62d:en

输入验证码:thought

thought

ArzwwQ6Yv33e0BU7MawrL62d:en

Traceback (most recent call last):

File "D:/360_svn/eyugame_python_exercise/121_remote_pro/crawler_ex/get_douban_move/douban_login.py", line 48, in

s_login=s.post(html_url,data=url_data,headers=headers)

File "C:\Python27_x86\lib\site-packages\requests\sessions.py", line 508, in post

return self.request('POST', url, data=data, json=json, **kwargs)

File "C:\Python27_x86\lib\site-packages\requests\sessions.py", line 451, in request

prep = self.prepare_request(req)

File "C:\Python27_x86\lib\site-packages\requests\sessions.py", line 382, in prepare_request

hooks=merge_hooks(request.hooks, self.hooks),

File "C:\Python27_x86\lib\site-packages\requests\models.py", line 304, in prepare

self.prepare_url(url, params)

File "C:\Python27_x86\lib\site-packages\requests\models.py", line 362, in prepare_url

to_native_string(url, 'utf8')))

requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://?

Process finished with exit code 1

问题出现在哪里?

还有一疑问,requests函数

http://docs.python-requests.org/en/latest/user/advanced/

s = requests.Session()

这边是大写的Session,有些地方看到是小写的session的,有咋区别。

===========

update 更新信息~~~

模拟登入问题已搞定,出现在最后的post请求上,第一个参数我给的不是url参数,

修改后的代码:

# /usr/bin/python

#coding:utf-8

__author__ = 'eyu Fanne'

import requests

from bs4 import BeautifulSoup

headers={

"Host":"www.douban.com",

"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0",

"Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",

"Accept-Encoding":"gzip, deflate",

"Connection":"keep-alive"

}

s=requests.session()

s.headers.update(headers)

login_url=r'https://www.douban.com/accounts/login'

html_url = s.get(login_url,headers=headers)

print s.cookies.items()

print "html_url code %s" %html_url.status_code

html_txt = html_url.text

html_soup = BeautifulSoup(html_txt,'lxml')

img_soup = html_soup.find_all('img',class_="captcha_image")

for img_i in img_soup:

print img_i['src']

cap_img=img_i['src']

for i in html_soup.find_all("input",attrs={"name":"captcha-id"}):

print i['value']

cap_i = i['value']

captcha_solution=raw_input('输入验证码:')

captcha_id=cap_i

print captcha_solution

print captcha_id

url_data={

"source":"index_nav",

"form_email":"******",

"form_password":"******",

"captcha-solution":captcha_solution,

"captcha-id":captcha_id,

}

s_login=s.post(login_url,data=url_data,headers=headers)

print s.cookies.items()

执行结果:

[('bid', '"Ojx9+4qSsdw"')]

html_url code 200

https://www.douban.com/misc/captcha?id=ryEmaBD2QermvX2BSPncxIuY:en&size=s

ryEmaBD2QermvX2BSPncxIuY:en

输入验证码:opposite

opposite

ryEmaBD2QermvX2BSPncxIuY:en

[('bid', '"Ojx9+4qSsdw"'), ('ck', '"malX"'), ('dbcl2', '"41572135:JiIAk8PlKLw"'), ('ue', '"896661380@qq.com"')]

Process finished with exit code 0

最后那个session函数还是没搞懂。

还有一疑问,requests函数

http://docs.python-requests.org/en/latest/user/advanced/

s = requests.Session()

这边是大写的Session,有些地方看到是小写的session的,有咋区别。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值