python爬虫入门

最新推荐文章于 2021-12-15 15:13:13 发布

午阿哥

最新推荐文章于 2021-12-15 15:13:13 发布

阅读量259

点赞数

分类专栏： python 文章标签： python idea 爬虫

本文链接：https://blog.csdn.net/sinat_27912569/article/details/79878777

版权

python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

python爬虫入门

首先注意，学习新东西，需要迅速的成就感，所以有其他编程语言基础或者略懂的同志们，可以直接上手写代码，哪里不会学哪里，先搞个基本例子，有结果的；之后在继续深入研究；
环境：idea编辑器，python3

1.直接上代码

这是一个可以直接看到打印结果的菜鸟级爬虫####打开连接的基本代码，可以获取网页编码

其实天下文章一大抄，代码也能先抄抄；先爽了再说；

from urllib import request

if __name__ == "__main__":
    response = request.urlopen("https://www.cnblogs.com/panzi/p/6421826.html")
    html = response.read()
    html = html.decode("utf-8")
    print(html)

2.之后可能需要各种python框架和各种基本插件，边看边学；

3.需要获取网页中连接，循环打开，关键词：re–正则表达式

4.将需要的信息写入文件，关键词：文件写入

各种需要模块导入，请百度

# 抓取菜鸟网址的个个技术页面信息子信息及相关描述
# 升级----写入文件中
# 升级----写入数据库

from urllib import request
import urllib
import re
import pymysql


# 数据连接
def testMysql(value1):
    conn = pymysql.connect(host='localhost', port=3306, user='root', password='123', db='test', charset='utf8')
    cu = conn.cursor()
    #cu.execute('select * from resource_biz_chain')
    try:
        cu.execute('insert into runoob2(title1,url,description) values(%s,%s,%s)',value1)
        print(cu)
        conn.commit()
    except Exception as e:
        print(e)
    #cu.fetchall()
    #res = cu.fetchall()

    cu.close()
    conn.close()


# 打开url连接，返回页面代码
def openurl(url):
    response = request.urlopen(url)
    html = response.read()
    html = html.decode("utf-8")
    return html


f1 = ''
f2 = ''
urlList2 = ''


if __name__ == "__main__":
    #testMysql()
    reg = r'<li><a href="(.*?)">(.*?)</a></li>'
    urlList = re.findall(reg, openurl("http://www.runoob.com/python3/python3-tutorial.html"))
    for iurl, iname in urlList:
        if iname == '首页' or iname == '更多……' or iname == '用户登录' or iname == '注册新用户':
            print('无用的url：' + iurl + iname)
        else:
            htmlp = "http://www.runoob.com" + iurl
            print('需要的技术标签：' + iname + ',url:' + htmlp)
            reg1 = r'<a target="_top" title="(.*?)" href="/(.*?).html">'
            urlList1 = re.findall(reg1, openurl(htmlp))
            for urlname, urlson in urlList1:
                global f2
                global urlList2
                global f1
                htmlson = 'http://www.runoob.com/' + urlson + '.html'
                print(urlname + '--' + htmlson)
                f1=str(urlname)
                f2 = str(htmlson)
                ex = r'<meta name="description" content="(.*?)">'
                fa= str(re.findall(ex, openurl(htmlson)))
                urlList2 = fa
                print(str(urlList2))
                # 写入文件，成功
                # with open('D:\\url.txt',mode='a+',encoding='utf-8') as f:
                #    f.write(str(f1)+'\n')
                #    f.write(str(f2)+'\n')
                #    f.write(str(urlList2)+'\n')
                #    f.close()
                value1=[f1,f2,urlList2]
                testMysql(value1)