python爬取优词词典

运用python爬取优词词典并制作索引

前期准备:

1.python学习

2.了解网络知识

3.了解爬虫原理

4.requests模块的运用知识

5.Beautiful模块的理解运用

6.数据库知识的运用

7.pymysql的运用

 

在这里我不在赘述python的安装以及pip安装requests,pymysql,Beautiful网上有很多教程(前期请面向百度编程)

 


做好前面几点,我们开始编写爬虫

 

1明确目标:目标网站 http://www.youdict.com/ciku/

 

目标元素:单词(包括英文,中文),单词连接,图片连接

 

 

2.编写获取页面以及获取元素代码:

 

newsurl='http://www.youdict.com\
        /ciku/id_5_0_0_0_0.html'
res=requests.get(newsurl)
#res=requests.get(newsurl)
res.encoding='utf-8'
soup =BeautifulSoup(res.text,'html.parser')
#print(soup)
divs=soup.select(".col-sm-6")
#print(divs[0])
for each_div in divs:
    english=each_div.div.div.h3.a.text
    imgurl=transurl(each_div.div.img['src'])
    chinese=each_div.div.p.text
    #print(english+"   "+imgurl+"  "+chinese)
    insert(english,chinese,imgurl)

 

3.根据页面跳转规则拼接url:

newsurl='http://www.youdict.com\ciku/id_5_0_0_0_'+str(i)+'.html'

i  是由循环确定

 


 

4.连接数据库:

def insert(english,chinese,imgurl):
db = pymysql.connect("localhost","root","your\
db pass","your db name" ) 
cursor = db.cursor()
#summary = summary.tostring(summary,encoding='utf-8')
english=pymysql.escape_string(english)
chinese=pymysql.escape_string(chinese)
imgurl=pymysql.escape_string(imgurl)
sql="insert into reaserchwords(english,chinese,\
imgurl) values('"+english+"','"+chinese+"','"+imgurl+"')"
cursor.execute(sql)
db.commit()
db.close()

 


 

5.组合起来完整的爬虫:

 

# coding=utf-8
'''
Created on 2018.8.18
@author: ZEC---
'''

import requests
import pymysql
from bs4 import BeautifulSoup


def insert(english,chinese,imgurl):
    db = pymysql.connect("localhost","root","your\
    db pass","your db name" ) 
    cursor = db.cursor()
    #summary = summary.tostring(summary,encoding='utf-8')
    english=pymysql.escape_string(english)
    chinese=pymysql.escape_string(chinese)
    imgurl=pymysql.escape_string(imgurl)
    sql="insert into reaserchwords(english,chinese,\
    imgurl) values('"+english+"','"+chinese+"','"+imgurl+"')"
    
    cursor.execute(sql)
    db.commit()
    db.close()
def transurl(url):
    url="http://www.youdict.com"+url
    url.strip('\n')
    return url
def main_thread(start,end):
    i=start
    while i<end: 
        newsurl='http://www.youdict.com\
        /ciku/id_5_0_0_0_'+str(i)+'.html'
        res=requests.get(newsurl)
        #res=requests.get(newsurl)
        res.encoding='utf-8'
        soup =BeautifulSoup(res.text,'html.parser')
        #print(soup)
        divs=soup.select(".col-sm-6")
        #print(divs[0])
        for each_div in divs:
            english=each_div.div.div.h3.a.text
            imgurl=transurl(each_div.div.img['src'])
            chinese=each_div.div.p.text
            #print(english+"   "+imgurl+"  "+chinese)
            insert(english,chinese,imgurl)
        print(str(i+1)+"页面 is ok")
        i=i+1

main_thread(67,274)

 


 

自己做的单词搜索页面如图:

搜索案例网址:www.senlear.com/words

原文链接: http://www.encai.online/Info/index/id/72.html

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值