python抓取html中特定的数据库,Python抓取网页中内容,正则分析后存入mysql数据库...

firefox+httpfox可以查看post表单

首先在http://www.renren.com/这个地址输入用户名和密码,

输入用户名和密码之后post到下面这个网址:

http://www.renren.com/PLogin.do

#renren.py

import urllib

import urllib2

import cookielib

cookie = cookielib.CookieJar()

opener =

urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))

postdata=urllib.urlencode({

'email':'',

#your account

'password':'' #your password

})

req = urllib2.Request(

url='http://www.renren.com/PLogin.do',

data=postdata

)

result=opener.open(req)

print result.read()

这样就已经登陆人人网了。

打印出来的是已登陆界面的html源码。

二、抓取网页并获得需要的信息

这里以股票网站seekingalpha为例(sorry no offending)打开SA,准备抓取

import urllib

import urllib2

content=urllib2.urlopen('http://seekingalpha.com/symbol/GOOGL?s=googl').read()

print content

下面就会打印出GOOGL股票的页面。

*注意这里并没有使用post因为这个网站不登陆也可以看>

下面分析正则表达式:

写出正则表达式:pattern=re.compile(r'href="/article.*sasource’)

这样会找到所有指向评论页面的链接,若打印的话GOOG会有下面这些:

http://seekingalpha.com/article/2250373-energetic-moves-for-google

http://seekingalpha.com/article/2249173-google-bringing-satellite-internet-to-the-world

http://seekingalpha.com/article/2247383-what-googles-self-driving-car-says-about-the-company

http://seekingalpha.com/article/2238623-europe-tries-to-censor-google

http://seekingalpha.com/article/2236283-google-is-reportedly-mulling-expansion-in-outer-space

http://seekingalpha.com/article/2234863-what-will-googles-30-billion-in-foreign-acquisitions-do

http://seekingalpha.com/article/2229953-in-defense-of-google-glass

http://seekingalpha.com/article/2229163-android-fragmentation-and-the-cloud

http://seekingalpha.com/article/2227963-everything-you-need-to-know-about-twitch-tv-and-why-company-could-be-a-steal-for-google

http://seekingalpha.com/article/2226203-google-adds-quest-visual-to-its-portfolio-m-and-a-overview

http://seekingalpha.com/article/2223103-goog-vs-googl-a-classic-pairs-trade

http://seekingalpha.com/article/2222373-google-or-apple-which-is-the-better-long-term-bet

http://seekingalpha.com/article/2220023-a-look-at-everything-thats-wrong-with-google-glass

http://seekingalpha.com/article/2198683-analysis-of-oral-argument-in-vringo-vs-google-patent-infringement-appeal

http://seekingalpha.com/article/2193673-google-investors-can-expect-upside-potential

http://seekingalpha.com/article/2191843-google-is-a-stock-to-own-for-the-long-term

http://seekingalpha.com/article/2187033-google-7-different-insiders-have-sold-shares-during-the-last-30-days

http://seekingalpha.com/article/2169973-google-facing-some-problems-in-the-mobile-advertising-market

http://seekingalpha.com/article/2168773-google-strikes-deal-with-buffett-backed-wind-generator

http://seekingalpha.com/article/2165243-why-google-has-upside-to-nearly-650

http://seekingalpha.com/article/2251473-what-wwdc-says-about-apples-new-products

http://seekingalpha.com/article/2251063-how-apples-iphones-might-become-an-indispensable-piece-of-equipment-again

http://seekingalpha.com/article/2250973-will-apple-outsmart-google-in-the-internet-of-things

http://seekingalpha.com/article/2249683-demand-medias-c-and-m-business-prospects-boosted-by-new-google-search-algorithm-changes

http://seekingalpha.com/article/2248843-googles-satellites-pose-threat-to-sirius-xm

http://seekingalpha.com/article/2248193-facebook-battling-google-for-eyeballs

http://seekingalpha.com/article/2248143-wall-street-breakfast-must-know-news

http://seekingalpha.com/article/2246013-apple-something-extraordinary-is-certain

http://seekingalpha.com/article/2245693-why-you-shouldnt-believe-the-himax-google-break-up-rumor

http://seekingalpha.com/article/2244133-dividends-role-in-wealth-creation-sector-analysis

http://seekingalpha.com/article/2242083-the-defensive-portfolio-focusing-on-competitive-advantage

http://seekingalpha.com/article/2242023-vringos-q1-report-shows-mixed-results-is-a-secondary-offering-just-around-the-corner

http://seekingalpha.com/article/2241533-is-facebook-at-the-peak-of-its-share-price

http://seekingalpha.com/article/2240663-wall-street-breakfast-must-know-news

http://seekingalpha.com/article/2240493-blackberry-z3-seems-too-late-to-the-party

http://seekingalpha.com/article/2238893-why-apple-beats-partnership-will-change-competitive-landscape-for-music-streaming

http://seekingalpha.com/article/2238073-apples-split-what-you-need-to-know

http://seekingalpha.com/article/2236983-lady-liberty-rescues-vringo-google-royalty-tab-to-exceed-1_8-billion

http://seekingalpha.com/article/2236893-high-time-for-investors-to-buy-into-samsung

http://seekingalpha.com/article/2231733-lenovo-making-the-right-strategic-moves-to-build-value

下面是完整python代码:

#table commenturl

#CREATE TABLE `commenturl` (

#  `id` int(11) unsigned NOT NULL

AUTO_INCREMENT,

#  `object` varchar(30) DEFAULT NULL,

#  `url` varchar(1024) DEFAULT NULL,

#  PRIMARY KEY (`id`)

#  ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

#truncate table commenturl----set autoincrement to be 1

import MySQLdb

import urllib2

headers =

{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1;

en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

req = urllib2.Request(url = 'http://seekingalpha.com/symbol/GOOG?s=goog',headers

= headers)

content=urllib2.urlopen(req).read()

import sys

import os

import re

links=re.findall(r'href="/article.*sasource',content)

try:

conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)

cur=conn.cursor()

conn.select_db('usr')

except MySQLdb.Error,e:

print "Mysql

Error %d: %s" % (e.args[0], e.args[1])

for url in links:

ct=len(url)

url=url[6:(ct-10)]

url='http://seekingalpha.com'+url

print url

cur.execute("INSERT INTO COMMENTURL(object,url)

VALUES('GOOG',%s)",url)

conn.commit()

注意:网站会为了防止爬虫而出现Error 403 Forbidden,这时要模拟浏览器访问,代码:req =

urllib2.Request(url ='http://seekingalpha.com/symbol/GOOG?s=goog',headers

= headers)

总之上面是全的源码还有mysql建表语句。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值