Ubuntu 14.04 安装Scrapy 指南

How to Install Scrapy a Web Crawling Tool in Ubuntu 14.04 LTS


原文链接:http://linoxide.com/ubuntu-how-to/scrapy-install-ubuntu/?utm_source=tuicool

January 7, 2015 |  By nido in OPEN SOURCE TOOLSUBUNTU HOWTO

It is an open source software which is used for extracting the data from websites. Scrapy framework is developed in Python and it perform the crawling job in fast, simple and extensible way.  We have created a Virtual Machine (VM) in virtual box and Ubuntu 14.04 LTS is installed on it.

Install Scrapy

Scrapy is dependent on Python, development libraries and pip software. Python latest version is pre-installed on Ubuntu. So we have to install pip and python developer libraries before installation of Scrapy.

Pip is the replacement for easy_install for python package indexer. It is used for installation and management of Python packages. Installation of pip package is shown in Figure 1.

sudo apt-get install python-pip

installation of python package indexerFig:1 Pip installation

We have to install python development libraries by using following command. If this package is not installed then installation of scrapy framework generates error about python.h header file.

sudo apt-get install python-dev

Libraries for Python DevelopmentFig:2 Python Developer Libraries

Scrapy framework can be installed either from deb package or source code. However we have installed deb package using pip (Python package manager) which is shown in Figure 3.

sudo pip install scrapy

Installation of Scrapy Fig:3 Scrapy Installation

Scrapy successful installation takes some time which is shown in Figure 4.

Scrapy Framework  Fig:4 Successful installation of Scrapy Framework

 Data extraction using Scrapy framework

(Basic Tutorial)

We will use Scrapy for the extraction of store names (which are providing Cards) item from fatwallet.com web site. First of all, we created new scrapy project “store_name” using below given command and shown in Figure 5.

$sudo scrapy startproject store_name

New project in Scrapy FrameworkFig:5 Creation of new project in Scrapy Framework

 Above command creates a directory with title “store_name” at current path. This main directory of the project contains files/folders which are shown in the following Figure 6.

$sudo ls –lR store_name

Project store_nameFig:6 Contents of store_name project.

A brief description of each file/folder is given below;

  • scrapy.cfg is the project configuration file
  • store_name/ is another directory inside the main directory. This directory contains python code of the project.
  • store_name/items.py contains those items which will be extracted by the spider.
  • store_name/pipelines.py is the pipelines file.
  • Setting of store_name project is in store_name/settings.py file.
  • and the store_name/spiders/ directory, contains spider for the crawling

As we are interested to extract the store names of the Cards from fatwallet.com site, so we updated the contents of the file as shown below.

import scrapy

class StoreNameItem(scrapy.Item):

   name = scrapy.Field()   # extract the names of Cards store

After this, we have to write new spider under store_name/spiders/ directory of the project. Spider is python class which consist of following mandatory attributes :

  1. Name of the spider (name )

  2. Starting url of spider for crawling (start_urls)
  3. And parse method which consist of regex for the extraction of desired items from the page response. Parse method is the important part of spider.

We created spider “store_name.py” under store_name/spiders/ directory and added following python code for the extraction of store name from fatwallet.com site. The output of the spider is written in the file (StoreName.txt) which is shown in Figure 7.

from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
import re
class StoreNameItem(BaseSpider):
name = 
"storename"
allowed_domains = ["fatwallet.com"]
start_urls = [
"http://fatwallet.com/cash-back-shopping/"]

def parse(self,response):
output = open(
'StoreName.txt','w')
resp = Selector(response)

tags = resp.xpath('//tr[@class="storeListRow"]|\
         //tr[@class="storeListRow even"]|\
         //tr[@class="storeListRow even last"]|\
          //tr[@class="storeListRow last"]'
).extract()
for in tags:
i = i.encode(
'utf-8''ignore').strip()
store_name = 
''
if 
re.search(r"class=\"storeListStoreName\">.*?<",i,re.I|re.S):
store_name = re.search(
r"class=\"storeListStoreName\">.*?<",i,re.I|re.S).group()
store_name = re.search(
r">.*?<",store_name,re.I|re.S).group()
store_name = re.sub(
r'>',"",re.sub(r'<',"",store_name,re.I))
store_name = re.sub(
r'&amp;',"&",re.sub(r'&amp;',"&",store_name,re.I))
#print store_name
output.write(store_name+
""+"\n")

Spider CodeFig:7 Output of the Spider code .

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Ubuntu 14.04安装方法有多种。首先,你可以从官方网站上下载Ubuntu Kylin 14.04.1的安装文件,并按照官方提供的步骤进行安装。另外,你也可以选择安装Ubuntu Server版本,具体步骤如下: 1. 首先,下载Ubuntu Server 14.04的ISO镜像文件,并制作成系统盘。 2. 将制作好的系统盘插入计算机,并启动计算机。 3. 在Ubuntu界面中选择"Install Ubuntu Server",然后按"Enter"键继续安装。 4. 根据安装向导的提示,选择合适的语言、时区、键盘布局等选项。 5. 在磁盘分区界面,选择你想要安装Ubuntu的磁盘,并根据需要进行分区。你可以选择使用整个磁盘,或者手动设置分区。 6. 设置主机名、用户名和密码,这将是你登录系统时使用的凭据。 7. 在软件选择界面,选择你需要安装的软件包,或者使用默认选项。 8. 等待安装完成,然后重新启动计算机。 这样,你就可以成功安装Ubuntu 14.04了。请记住,安装过程中可能会有一些特定的步骤和配置,具体根据你的需求进行调整。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [ubuntu14.04安装方法汇总](https://blog.csdn.net/baobei0112/article/details/43083245)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* *3* [ubuntu server 14.04安装手册](https://blog.csdn.net/yugemengjing/article/details/86619066)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值