做大作业老师要求帮他们组运行一个爬虫程序,下载源码后在Anaconda里运行,发现了奇怪的报错。
Traceback (most recent call last):
File "ccf_crawler.py", line 118, in <module>
save_dblp_papers()
File "ccf_crawler.py", line 102, in save_dblp_papers
watcher_process.start()
File "E:\jupter\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "E:\jupter\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "E:\jupter\lib\multiprocessing\context.py", line 326, in _Popen
return Popen(process_obj)
File "E:\jupter\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "E:\jupter\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'paper_crawler.crawler_manager.CrawlerManager'>: it's not the same object as paper_crawler.crawler_manager.CrawlerManager
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "E:\jupter\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "E:\jupter\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
是python pickle报错,pickle是一个简单的持久化功能,将对象序列化后以文件的形式保存。为什么在windows里运行不了呢?查阅资料,发现关于pickle和进程
原来这个爬虫是用进程编写的,而在windows系统中,进程使用socket对象,而socket对象是不可序列化的;在linux系统中,进程使用的是fork对象,因此可以被序列化。
所以解决方案有二:
- 重构代码,用线程而不是进程编写,对性能不会造成大的影响,这样就可以在windows上运行
- 使用linux系统运行即可
因此,打开了ubuntu虚拟机,配置环境,安装mongoDB,运行代码,成功!
没内存啦,唉
插播:Ubuntu 14.04系统安装MongoDB的报错
安装网络教程安装MongoDB,后出现报错
mongod --dbpath data/db
Mon Dec 7 19:03:14.143 [initandlisten] MongoDB starting : pid=129519 port=27017 dbpath=data/db 64-bit host=ubuntu
Mon Dec 7 19:03:14.143 [initandlisten] db version v2.4.9
Mon Dec 7 19:03:14.143 [initandlisten] git version: nogitversion
Mon Dec 7 19:03:14.143 [initandlisten] build info: Linux orlo 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 x86_64 BOOST_LIB_VERSION=1_54
Mon Dec 7 19:03:14.143 [initandlisten] allocator: tcmalloc
Mon Dec 7 19:03:14.143 [initandlisten] options: { dbpath: "data/db" }
Mon Dec 7 19:03:14.149 [initandlisten] journal dir=data/db/journal
Mon Dec 7 19:03:14.149 [initandlisten] recover : no journal files present, no recovery needed
Mon Dec 7 19:03:14.149 [initandlisten]
Mon Dec 7 19:03:14.149 [initandlisten] ERROR: Insufficient free space for journal files
Mon Dec 7 19:03:14.149 [initandlisten] Please make at least 3379MB available in data/db/journal or use --smallfiles
Mon Dec 7 19:03:14.149 [initandlisten]
Mon Dec 7 19:03:14.150 [initandlisten] exception in initAndListen: 15926 Insufficient free space for journals, terminating
Mon Dec 7 19:03:14.150 dbexit:
Mon Dec 7 19:03:14.150 [initandlisten] shutdown: going to close listening sockets...
Mon Dec 7 19:03:14.150 [initandlisten] shutdown: going to flush diaglog...
Mon Dec 7 19:03:14.150 [initandlisten] shutdown: going to close sockets...
Mon Dec 7 19:03:14.150 [initandlisten] shutdown: waiting for fs preallocator...
Mon Dec 7 19:03:14.150 [initandlisten] shutdown: lock for final commit...
Mon Dec 7 19:03:14.150 [initandlisten] shutdown: final commit...
Mon Dec 7 19:03:14.150 [initandlisten] shutdown: closing all files...
Mon Dec 7 19:03:14.150 [initandlisten] closeAllFiles() finished
Mon Dec 7 19:03:14.150 [initandlisten] journalCleanup...
Mon Dec 7 19:03:14.150 [initandlisten] removeJournalFiles
Mon Dec 7 19:03:14.150 [initandlisten] shutdown: removing fs lock...
Mon Dec 7 19:03:14.150 dbexit: really exiting now
是因为内存不够了啊,orz,没办法,虚拟机太小了,只能使用替代命令–smallfiles运行了
注意空格之类的格式(这几天空格踩雷好多,疲惫)(要注意细节啊),
运行命令
mongod --dbpath data/db --smallfiles
成功!
Mon Dec 7 19:03:59.144 [initandlisten] MongoDB starting : pid=129886 port=27017 dbpath=data/db 64-bit host=ubuntu
Mon Dec 7 19:03:59.145 [initandlisten] db version v2.4.9
Mon Dec 7 19:03:59.145 [initandlisten] git version: nogitversion
Mon Dec 7 19:03:59.145 [initandlisten] build info: Linux orlo 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 x86_64 BOOST_LIB_VERSION=1_54
Mon Dec 7 19:03:59.145 [initandlisten] allocator: tcmalloc
Mon Dec 7 19:03:59.145 [initandlisten] options: { dbpath: "data/db", smallfiles: true }
Mon Dec 7 19:03:59.148 [initandlisten] journal dir=data/db/journal
Mon Dec 7 19:03:59.148 [initandlisten] recover : no journal files present, no recovery needed
Mon Dec 7 19:03:59.215 [FileAllocator] allocating new datafile data/db/local.ns, filling with zeroes...
Mon Dec 7 19:03:59.215 [FileAllocator] creating directory data/db/_tmp
Mon Dec 7 19:03:59.217 [FileAllocator] done allocating datafile data/db/local.ns, size: 16MB, took 0 secs
Mon Dec 7 19:03:59.218 [FileAllocator] allocating new datafile data/db/local.0, filling with zeroes...
Mon Dec 7 19:03:59.219 [FileAllocator] done allocating datafile data/db/local.0, size: 16MB, took 0 secs
Mon Dec 7 19:03:59.223 [initandlisten] waiting for connections on port 27017
Mon Dec 7 19:03:59.225 [websvr] admin web console waiting for connections on port 28017
Mon Dec 7 19:07:59.266 [PeriodicTask::Runner] task: DBConnectionPool-cleaner took: 8ms
Mon Dec 7 19:48:17.641 [PeriodicTask::Runner] task: WriteBackManager::cleaner took: 5ms