I have more than 5000 webpages i want the titles of all of them. In my project i am using the BeautifulSoup html parser like this.
soup = BeautifulSoup(open(url).read())
soup('title')[0].string
But its taking lots of time. Just for the title of a webpage i am reading the entire file and building the parse tree(I thought this is the reason for delay, correct me if i am wrong).
Is there in any other simple way to do this in python.
解决方案
It would certainly be faster if you just used a simple regular expression, BeautifulSoup is pretty slow. You could do something like:
import re
regex = re.compile('
(.*?)', re.IGNORECASE|re.DOTALL)regex.search(string_to_search).group(1)