1. Xpath Issue
The xpath selector in Scrapy implements a different xpath rule regards "text()" selector with respect to the default lxml library.
In Scrapy, the text() selector will return all the text string, both directly and indirectly, wrapped by the given node. However, by default the "text()" selector in lxml lib behave differently by return only the text string directly wrapped by the upper level selector.
2. In-site search engine for large B2C website
In-site search engine will first proceed PoS segmentation on user inputs (such as "雅诗兰黛肌初赋活原生液双件套"), then match the resulted PoS collection (word vector) with the mostly overlapping PoS collections (word vector) of the title that can be found in the product database.
3. Scrapy Framework Issue
Each spider can have its own item pipeline to call after response has been received, but all spiders share a public items definition by default.
4. Relative FIle Path
.. = parent path
../.. parent parent path
5. Iterate Dictionary Object
for i, j in Dict:
print(i)
print(j)
6. Python Relative Import
PEA:
Note that relative imports are based on the name of the current module. Since the name of the main module is always "__main__", modules intended for use as the main module of a Python application must always use absolute imports.
When testing a python module, it is a common practice to execute the module directly from IDE with the module's parent directory as the working dir. In most instances there is no problem in doing that, however, when the module executed contains a relative import statement an exception will be raised.
7. Python Exception Warning:
Msg Threw By CPython:
exceptions.KeyError: 'XXXX'
Reason: The given key cann't be found in the dictionary
8. Python PyCharm Debug:
Left click on the left edge of the editor box, then right click to define the break condition ( such as: a == 100). Execute the debug process by clicking the debug button in the run menu.
9. In one scrpy project, only one pipeline stream is in effect, although multiple spider may be actively working concurrently through twisted engine.
In order to run multiple spiders in one process concurrently, multiple crawlers (object inherted from crawler class in scrapy core api) should be initiated. User defined spiders are then loaded into these crawlers individules. These crawlers then can be started by calling crawler.start()
10. Current Working Director
The current working director is used to determine how a relative path behave, when a path is used as a parameter and a relative path is used by the program. The compiler will create the absolute path based on the cwd and the relative path given, and then used the absolute path to execute the command being called.
11. When debug using pydev under within PyCharm IDE, the unimported .pyc file will also be loaded and executed, which causes extra bugs. The reason behind this remains unknow.
12. HTTP Request
The request url and response are essentially the same.
URLs can only be sent over the Internet using ASCII character-set, therefore, all the other characters must be converted to a valid ASCII format.
URL encoding replace unsafe ASCII characters with a "%" followed by two hexadecimal digits.
13. The default read method of the file object in Python27 will return a string type, in contract, the read method of the codecs.open object will return a unicode type
14. In MySQL db, there are three different cursor that can be used in sql query:
fetchone, fetchall, fetchmany
To fetch 870 k rows of records, it will take 17 second to finish the query and data transmision process from local host to local program.
15. In PyCharm, a breaking point placed on a pass statement has no effect to the program
16. In lxml, the the xpath corresponding to a DOM by class reference should give all the class names; given only one class name will not find the desired DOM.
Example:
s = u'<html><div class="proTit bbp">aaa<span>TTT<em>GGG</em></span>abc</div></html>'
f = StringIO(s)
doc = etree.parse(f)
a = doc.xpath(r'//div[@class="proTit bbp"]')
17. In python2.7, the default character encoding is ASCII, which means if you try to convert a string type obj to an Unicode object without indicating an encoding, it will be conveted into ASCII.
For build-in function unicode(object[,encoding[,errors]]):
The encoding parameter notify the interpreter that the object ready to be converted is encoded in what kinds of code point system. If no encoding is given, then the interper will assume that ACSII is the encoding of the given object.
18. lxml, output encoding
In python2.7, the return type of a lxml xpath selector for text is a str type when there's no non-ASCII character in the text. When there is non-ASCII character in the text, the selector will return an unicode type object.
19. PhpMyADMIN bug
Type in a query statement that contains incorrect data type can still be processed. (Possibly PhpMyADMIN convert the type to correct it)
Example:
INSERT INTO skuprice (vendor_url, source_name, price, scrapping_time, sales_volume, comments, sku_name, B2C_platform) VALUES ('http://item.jd.com/1540589907.html', '兰嘉丝汀(LANCASTER)365美肌再现密集修护精华露', '630.0', '1434001938.15', '-999', '0', '兰嘉丝汀365美肌再现密集修护精华露(365精华)', '京东')
Then in B2C_platform column, which is an int type column, a value of 0 was inserted.
20. In scrapy (Python2.7)
Inside one crawler (bot), multiple spiders share a common spiders. Therefore, the __init__ method of the pipeline will be called exactly once.
21 Database Insert
When using Innodb storage engine in MySQL in a long connection style, Innodb throws wait time error, after changing to MyISAM, all things went fine. Reason remains unknown.
22 Scrapy Debug
To debug Scrapy Spider, one can use scrapy shell, possibly with ipython for the xpath test.
23. MySQL Database Setup
Always better to use some settings you already understand when all options are open, even by doing this the default settings may be changed.
24. Python shutil module
Copy method will copy and override the destination file, while copyfile will throw an exception indicting that the destination file already exist
25. Create webserver and web bot process from within the launcher program.
system() function execute a system command
command format: "start" command, run a program in a separate console window.
Note:
1. start: Enables a user to start a separate window in Windows from the Windows command line, the stdout and stderr will be directly to these spearate windows correspondingly. The current command window will be closed after the program specific cmd windows is opened. The individual programs will be closed until their own cmd window is shut down.
2. start /b: adding a parameter /b will redirect the stdout and stderr of the program initiated by this command to the current command window. Note that the program will still be launched concurrently ( when using system() function to give this command instead of cmd manual input), but these concurrent launched program will be terminated when the current command window is shut down.
command format: "taskkill /F /T /IM MyProcess.exe"