crawler
urllib2 is a library bundled with Python that makes it easy to download pages—all
you have to do is supply the URL. You’ll use it in this section to download the pages
that will be indexed. To see it in action, start up your Python interpreter and try this:
>> import urllib2
>> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')
>> contents=c.read( )
>> print contents[0:50]
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans'
The crawler will use the Beautiful Soup API that was introduced in Chapter 3, an
excellent library that builds a structured representation of web pages.
Using urllib2 and Beautiful Soup you can build a crawler that will take a list of URLs
to index and crawl their links to find other pages to index. First, add these import
statements to the top of searchengine.py:
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
# Create a list of words to ignore
ignorewords=set(['the','of','to','and','a','in','is','it'])
Building the Index
search engine
Content-Based Ranking
It calculates a score of a page, and retrieve all scores in turn for query.
Word frequency
The number of times the words in the query appear in the document can help
determine how relevant the document is.
Document location
The main subject of a document will probably appear near the beginning of the
document.
Usually, if a page is relevant to the search term, it will
appear closer to the top of the page, perhaps even in the title. To take advantage of
this, the search engine can score results higher if the query term appears early in the
document.
Word distance
If there are multiple words in the query, they should appear close together in the
document.
Normalization
The normalization function will take a dictionary of IDs and scores and return a new
dictionary with the same IDs, but with scores between 0 and 1. Each score is scaled
according to how close it is to the best result, which will always have a score of 1.
Using Inbound Links
The scoring metrics discussed so far have all been based on the content of the page.
Although many search engines still work this way, the results can often be improved
by considering information that others have provided about the page, specifically,
who has linked to the page and what they have said about it.
Simple Count
The easiest thing to do with inbound links is to count them on each page and use the
total number of links as a metric for the page. Academic papers are often rated in this
way, with their importance tied to the number of other papers that reference them.
Simple Count
The easiest thing to do with inbound links is to count them on each page and use thetotal number of links as a metric for the page. Academic papers are often rated in this
way, with their importance tied to the number of other papers that reference them.
The PageRank Algorithm
The PageRank algorithm was invented by the founders of Google, and variations onthe idea are now used by all the large search engines. This algorithm assigns every
page a score that indicates how important that page is. The importance of the page is
calculated from the importance of all the other pages that link to it and from the
number of links each of the other pages has.
Using the Link Text
Another very powerful way to rank searches is to use the text of the links to a page todecide how relevant the page is. Many times you will get better information from what
the links to a page say about it than from the linking page itself, as site developers tend
to include a short description of whatever it is they are linking to.
This code loops through all the words in wordids looking for links containing these
words. If the target of the link matches one of the search results, then the PageRank
of the source of the link is added to the destination page’s final score.
Learning from Clicks
One of the major advantages of online applications is that they receive constant feedbackin the form of user behavior.
To do this, you’re going to build an artificial neural network that you’ll train by
giving it the words in the query, the search results presented to the user, and what
the user decided to click. Once the network has been trained with many different
queries, you can use it to change the ordering of the search results to better reflect
what users actually clicked on in the past.
You might be wondering why you would need a sophisticated technique like a neural
network instead of just remembering the query and counting how many times each
result was clicked. The power of the neural network you’re going to build is that it
can make reasonable guesses about results for queries it has never seen before, based
on their similarity to other queries. Also, neural networks are useful for a wide variety
of applications and will be a great addition to your collective intelligence toolbox
1) Setting Up the Database
Since the neural network will have to be trained over time as users perform queries,
you’ll need to store a representation of the network in the database. The database
already has a table of words and URLs, so all that’s needed is a table for the hidden
layer (which will be called hiddennode) and two tables of connections (one from the
word layer to the hidden layer, and one that links the hidden layer to the output
layer).
create three tables:
self.con.execute('create table hiddennode(create_key)')
self.con.execute('create table wordhidden(fromid,toid,strength)')
self.con.execute('create table hiddenurl(fromid,toid,strength)')