1.request.raise_for_status()
This is a new function to add to our existing toolbox of ways to check the status of a request. When you call the function r.raise_for_status(), Requests will check the status of r and, if the status is a client or server error (corresponding with 4XX and 5XX status code ranges, respectively), it will raise an exception which halts the program. If the status is anything else, it will do nothing and allow the program to proceed. This is a very convenient way of catching a failed request early, rather than blindly checking its contents which might cause issues elsewhere in your program.
2.import re
The re library offers a couple of different ways to check a regex pattern against a string of text. In this case, we will be using re.findall(regex, string) which does exactly what the name suggests; it returns a list of all substrings of its string argument which match its regex argument. We do not need anything more complicated for our purposes here.
3.Metacharacters are characters that are interpreted in special ways and can be used to match more interesting strings. Here is a list of all of the metacharacters and a very brief description of what they do:
r“.”— matches basically any character
r“^”— matches the beginning of a line
r“$”— matches the end of a line
r“*”— matches 0 or more copies of the preceding symbol
r“+”— matches 1 or more copies of the preceding symbol
r“?”— matches either 0 or 1 copies of the preceding symbol
r“{a, b}”— matches between a and b copies of the preceding symbol
r“[ abc ]”— matches any one of the enclosed symbols*
r“\”— escapes the following character**
r“a|b”— matches either a or b
r“( )”— treats the enclosed symbols as a group***
4.An important first step any time that you are trying to write a program to find data on a page is to determine how it looks, not just on the webpage as it appears in your browser, but in the actual HTML that makes up the page.
5.Brief Guide to CSS Selectors
‘*’ — selects all elements. This is analogous to the r”.” regex metacharacter.
‘.class’ — selects all elements which have the class class. For example, the selector ‘.box’ would select the element ‘<div class=”box”></div>’ but not the objects ‘<div id=”box”></div>’ or ‘<div>box</div>’
‘element’ — the names of HTML element types (e.g., ‘div’, ‘a’, ‘body’) select all elements of that type. For example, the selector ‘a’ would match all DOM elements which are enclosed in the ‘<a></a>’ tags.
‘#id’ — selects the element with the id id. For example, the selector ‘#box2’ will match the object ‘<div id=”box2”></div>’ but not the element ‘<div class=”box2”></div>’. Note that the description that I just gave says that this selector selects the element withthe specified id. This is because the id of any DOM element must be unique
‘selector1, selector2’ — the comma acts as an OR. separating selectors with a comma will cause the resulting selector to match any item which is selected by either selector. i.e., the union of the two selections. For example, the selector ‘.box, a’ would select all elements with the class “box” and all <a> elements.
‘selector1 > selector2’ — selects all elements which are selected by selector2 and are also a direct child of an element selected by selector1. For example, the selector ‘.box > a’ would select the inner element of this element ‘<div class=”box”><a href=”#”></a></div>’ but not the inner element of this element ‘<a href=”#” ><div class=”box”></div></a>’.
‘selector1 + selector2’ — selects all elements which are selected by selector2 and are also immediately preceded in the DOM by an element selected by selector1. For example, the selector ‘.box + a’ would select the second element in this string ‘<div class=”box”></div> <a href=”#”></a>’ but not the second element in this string ‘<a href=”#” ></a> <div class=”box”></div>’.
‘selector1 ~ selector2’ — selects all elements which are selected by selector2 and are also preceded in the DOM by an element selected by selector1 and both elements share the same parent. For example, the selector ‘.box + a’ would select the ‘a’ element in this string ‘<div><div class=”box”></div> <a href=”#”></a></div>’ but not the ‘a’ element in this string ‘<div><div class=”box”></div></div> <a href=”#”></a>’.
5. It is also possible to write selectors which reference elements’ attributes. An attribute is something like the ‘href’ in all of the ‘a’ elements in the above examples. This is the syntax for writing attribute selectors:
‘[attribute]’ — selects all elements which have the specified attribute
‘[attribute=val]’ — selects all elements where the specified attribute is equal to ‘val’
‘[attribute~=val]’ — selects all elements with an attribute containing the word ‘val’
‘[attribute|=val]’ — selects all elements with an attribute list starting with ‘val’
‘[attribute^=val]’ — selects all elements with an attribute beginning with ‘val’
‘[attribute$=val]’ — selects all elements with an attribute ending with ‘val’
‘[attribute*=val]’ — selects all elements with an attribute containing the substring ‘val’
6.The biggest practical difference is that libraries are intended to be used entirely from inside Python whereas a framework, like scrapy, includes tools which need to be used externally, to do things like manage your project and run the final product. We will see this when we use scrapy to set up our project and when we run the resulting spider.
Another difference between a library and a framework is one that you might have guessed from the terminology; a framework is typically much more specific in terms of how you apply it to a problem and how you interact with it. A framework, as the name suggests, is a system for you to build your project into and around whereas, as we have seen, libraries are more like toolboxes which we can take things from as we need them.
7.Web crawling is that a crawling project has a much larger emphasis on traveling between individual webpages to explore entire websites or even entire corners of the internet.