Scrapy Spider Project Technical Notes

1. Xpath Issue

The xpath selector in Scrapy implements a different xpath rule regards "text()" selector with respect to the default lxml library.

In Scrapy, the text() selector will return all the text string, both directly and indirectly, wrapped by the given node. However, by default the "text()" selector in lxml lib behave differently by return only the text string directly wrapped by the upper level selector.


2. In-site search engine for large B2C website

In-site search engine will first proceed PoS segmentation on user inputs (such as "雅诗兰黛肌初赋活原生液双件套"), then match the resulted PoS collection (word vector) with the mostly overlapping PoS collections (word vector) of the title that can be found in the product database. 


3. Scrapy Framework Issue

Each spider can have its own item pipeline to call after response has been received, but all spiders share a public items definition by default.


4. Relative FIle Path

.. = parent path

../.. parent parent path


5. Iterate Dictionary Object

for i, j in Dict:

    print(i)

    print(j)


6. Python Relative Import

PEA:

Note that relative imports are based on the name of the current module. Since the name of the main module is always "__main__", modules intended for use as the main module of a Python application must always use absolute imports.


When testing a python module, it is a common practice to execute the module directly from IDE with the module's parent directory as the working dir. In most instances there is no problem in doing that, however, when the module executed contains a relative import statement an exception will be raised.  


7. Python Exception Warning: 

Msg Threw By CPython:

exceptions.KeyError: 'XXXX'


Reason: The given key cann't be found in the dictionary


8. Python PyCharm Debug:

Left click on the left edge of the editor box, then right click to define the break condition ( such as: a == 100). Execute the debug process by clicking the debug button in the run menu.


9. In one scrpy project, only one pipeline stream is in effect, although multiple spider may be actively working concurrently through twisted engine.

In order to run multiple spiders in one process concurrently, multiple crawlers (object inherted from crawler class in scrapy core api) should be initiated. User defined spiders are then loaded into these crawlers individules. These crawlers then can be started by calling crawler.start()


10. Current Working Director

The current working director is used to determine how a relative path behave, when a path is used as a parameter and a relative path is used by the program. The compiler will create the absolute path based on the cwd and the relative path given, and then used the absolute path to execute the command being called.


11. When debug using pydev under within PyCharm IDE, the unimported .pyc file will also be loaded and executed, which causes extra bugs. The reason behind this remains unknow.


12. HTTP Request

The request url and response are essentially the same.

URLs can only be sent over the Internet using ASCII character-set, therefore, all the other characters must be converted to a valid ASCII format.

URL encoding replace unsafe ASCII characters with a "%" followed by two hexadecimal digits.


13. The default read method of the file object in Python27 will return a string type, in contract, the read method of the codecs.open object will return a unicode type


14. In MySQL db, there are three different cursor that can be used in sql query:

fetchone, fetchall, fetchmany


To fetch 870 k rows of records, it will take 17 second to finish the query and data transmision process from local host to local program.


15. In PyCharm, a breaking point placed on a pass statement has no effect to the program


16. In lxml, the the xpath corresponding to a DOM by class reference should give all the class names; given only one class name will not find the desired DOM.

Example:

s = u'<html><div class="proTit bbp">aaa<span>TTT<em>GGG</em></span>abc</div></html>'

f = StringIO(s)

doc = etree.parse(f)

a = doc.xpath(r'//div[@class="proTit bbp"]')


17. In python2.7, the default character encoding is ASCII, which means if you try to convert a string type obj to an Unicode object without indicating an encoding, it will be conveted into ASCII.

For build-in function unicode(object[,encoding[,errors]]):

The encoding parameter notify the interpreter that the object  ready to be converted is encoded in what kinds of code point system. If no encoding is given, then the interper will assume that ACSII is the encoding of the given object.


18. lxml, output encoding

In python2.7, the return type of a lxml xpath selector for text is a str type when there's no non-ASCII character in the text. When there is non-ASCII character in the text, the selector will return an unicode type object.  


19. PhpMyADMIN bug

Type in a query statement that contains incorrect data type can still be processed. (Possibly PhpMyADMIN convert the type to correct it)

Example:

INSERT INTO skuprice (vendor_url, source_name, price, scrapping_time, sales_volume, comments, sku_name, B2C_platform) VALUES ('http://item.jd.com/1540589907.html', '兰嘉丝汀(LANCASTER)365美肌再现密集修护精华露', '630.0', '1434001938.15', '-999', '0', '兰嘉丝汀365美肌再现密集修护精华露(365精华)', '京东')


Then in B2C_platform column, which is an int type column, a value of 0 was inserted.


20. In scrapy (Python2.7)

Inside one crawler (bot), multiple spiders share a common spiders. Therefore, the __init__ method of the pipeline will be called exactly once.


21 Database Insert

When using Innodb storage engine in MySQL in a long connection style, Innodb throws wait time error, after changing to MyISAM, all things went fine. Reason remains unknown.


22 Scrapy Debug

To debug Scrapy Spider, one can use scrapy shell, possibly with ipython for the xpath test.

23. MySQL Database Setup

Always better to use some settings you already understand when all options are open, even by doing this the default settings may be changed.


24. Python shutil module

Copy method will copy and override the destination file, while copyfile will throw an exception indicting that the destination file already exist


25. Create webserver and web bot process from within the launcher program.

system() function execute a system command

command format: "start" command, run a program in a separate console window.

Note: 

1. start: Enables a user to start a separate window in Windows from the Windows command line, the stdout and stderr will be directly to these spearate windows correspondingly. The current command window will be closed after the program specific cmd windows is opened. The individual programs will be closed until their own cmd window is shut down.

2. start /b:   adding a parameter /b will redirect the stdout and stderr of the program initiated by this command to the current command window. Note that the program will still be launched concurrently ( when using system() function to give this command instead of cmd manual input), but these concurrent launched program will be terminated when the current command window is shut down.


command format: "taskkill /F /T /IM MyProcess.exe"

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值