firefox+httpfox可以查看post表单
首先在http://www.renren.com/这个地址输入用户名和密码,
输入用户名和密码之后post到下面这个网址:
http://www.renren.com/PLogin.do
#renren.py
import urllib
import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener =
urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
postdata=urllib.urlencode({
'email':'',
#your account
'password':'' #your password
})
req = urllib2.Request(
url='http://www.renren.com/PLogin.do',
data=postdata
)
result=opener.open(req)
print result.read()
这样就已经登陆人人网了。
打印出来的是已登陆界面的html源码。
二、抓取网页并获得需要的信息
这里以股票网站seekingalpha为例(sorry no offending)打开SA,准备抓取
import urllib
import urllib2
content=urllib2.urlopen('http://seekingalpha.com/symbol/GOOGL?s=googl').read()
print content
下面就会打印出GOOGL股票的页面。
*注意这里并没有使用post因为这个网站不登陆也可以看>
下面分析正则表达式:
写出正则表达式:pattern=re.compile(r'href="/article.*sasource’)
这样会找到所有指向评论页面的链接,若打印的话GOOG会有下面这些:
http://seekingalpha.com/article/2250373-energetic-moves-for-google
http://seekingalpha.com/article/2249173-google-bringing-satellite-internet-to-the-world
http://seekingalpha.com/article/2247383-what-googles-self-driving-car-says-about-the-company
http://seekingalpha.com/article/2238623-europe-tries-to-censor-google
http://seekingalpha.com/article/2236283-google-is-reportedly-mulling-expansion-in-outer-space
http://seekingalpha.com/article/2234863-what-will-googles-30-billion-in-foreign-acquisitions-do
http://seekingalpha.com/article/2229953-in-defense-of-google-glass
http://seekingalpha.com/article/2229163-android-fragmentation-and-the-cloud
http://seekingalpha.com/article/2227963-everything-you-need-to-know-about-twitch-tv-and-why-company-could-be-a-steal-for-google
http://seekingalpha.com/article/2226203-google-adds-quest-visual-to-its-portfolio-m-and-a-overview
http://seekingalpha.com/article/2223103-goog-vs-googl-a-classic-pairs-trade
http://seekingalpha.com/article/2222373-google-or-apple-which-is-the-better-long-term-bet
http://seekingalpha.com/article/2220023-a-look-at-everything-thats-wrong-with-google-glass
http://seekingalpha.com/article/2198683-analysis-of-oral-argument-in-vringo-vs-google-patent-infringement-appeal
http://seekingalpha.com/article/2193673-google-investors-can-expect-upside-potential
http://seekingalpha.com/article/2191843-google-is-a-stock-to-own-for-the-long-term
http://seekingalpha.com/article/2187033-google-7-different-insiders-have-sold-shares-during-the-last-30-days
http://seekingalpha.com/article/2169973-google-facing-some-problems-in-the-mobile-advertising-market
http://seekingalpha.com/article/2168773-google-strikes-deal-with-buffett-backed-wind-generator
http://seekingalpha.com/article/2165243-why-google-has-upside-to-nearly-650
http://seekingalpha.com/article/2251473-what-wwdc-says-about-apples-new-products
http://seekingalpha.com/article/2251063-how-apples-iphones-might-become-an-indispensable-piece-of-equipment-again
http://seekingalpha.com/article/2250973-will-apple-outsmart-google-in-the-internet-of-things
http://seekingalpha.com/article/2249683-demand-medias-c-and-m-business-prospects-boosted-by-new-google-search-algorithm-changes
http://seekingalpha.com/article/2248843-googles-satellites-pose-threat-to-sirius-xm
http://seekingalpha.com/article/2248193-facebook-battling-google-for-eyeballs
http://seekingalpha.com/article/2248143-wall-street-breakfast-must-know-news
http://seekingalpha.com/article/2246013-apple-something-extraordinary-is-certain
http://seekingalpha.com/article/2245693-why-you-shouldnt-believe-the-himax-google-break-up-rumor
http://seekingalpha.com/article/2244133-dividends-role-in-wealth-creation-sector-analysis
http://seekingalpha.com/article/2242083-the-defensive-portfolio-focusing-on-competitive-advantage
http://seekingalpha.com/article/2242023-vringos-q1-report-shows-mixed-results-is-a-secondary-offering-just-around-the-corner
http://seekingalpha.com/article/2241533-is-facebook-at-the-peak-of-its-share-price
http://seekingalpha.com/article/2240663-wall-street-breakfast-must-know-news
http://seekingalpha.com/article/2240493-blackberry-z3-seems-too-late-to-the-party
http://seekingalpha.com/article/2238893-why-apple-beats-partnership-will-change-competitive-landscape-for-music-streaming
http://seekingalpha.com/article/2238073-apples-split-what-you-need-to-know
http://seekingalpha.com/article/2236983-lady-liberty-rescues-vringo-google-royalty-tab-to-exceed-1_8-billion
http://seekingalpha.com/article/2236893-high-time-for-investors-to-buy-into-samsung
http://seekingalpha.com/article/2231733-lenovo-making-the-right-strategic-moves-to-build-value
下面是完整python代码:
#table commenturl
#CREATE TABLE `commenturl` (
# `id` int(11) unsigned NOT NULL
AUTO_INCREMENT,
# `object` varchar(30) DEFAULT NULL,
# `url` varchar(1024) DEFAULT NULL,
# PRIMARY KEY (`id`)
# ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
#truncate table commenturl----set autoincrement to be 1
import MySQLdb
import urllib2
headers =
{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1;
en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
req = urllib2.Request(url = 'http://seekingalpha.com/symbol/GOOG?s=goog',headers
= headers)
content=urllib2.urlopen(req).read()
import sys
import os
import re
links=re.findall(r'href="/article.*sasource',content)
try:
conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)
cur=conn.cursor()
conn.select_db('usr')
except MySQLdb.Error,e:
print "Mysql
Error %d: %s" % (e.args[0], e.args[1])
for url in links:
ct=len(url)
url=url[6:(ct-10)]
url='http://seekingalpha.com'+url
print url
cur.execute("INSERT INTO COMMENTURL(object,url)
VALUES('GOOG',%s)",url)
conn.commit()
注意:网站会为了防止爬虫而出现Error 403 Forbidden,这时要模拟浏览器访问,代码:req =
urllib2.Request(url ='http://seekingalpha.com/symbol/GOOG?s=goog',headers
= headers)
总之上面是全的源码还有mysql建表语句。