p2_The Web Speaks HTTP_＜CR＞＜LF＞_User-Agent_Query strings in URLs

LIQING LIN

于 2021-09-14 05:40:41 发布

阅读量1.8k

点赞数 1

分类专栏： Web Scraping with Python 文章标签： http html html5

本文链接：https://blog.csdn.net/Linli522362242/article/details/120275203

版权

Web Scraping with Python 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

In this chapter, we introduce one of the core building blocks that makes up the web: the HyperText Transfer Protocol (HTTP), after having provided a brief introduction to computer networks in general. We then introduce the Python requests library, which we’ll use to perform HTTP requests and effectively start retrieving websites with Python. The chapter closes with a section on using parameters in URLs.

2.1 The Magic of Networking

Nowadays, the web has become so integrated into our day-to-day activities that we rarely consider its complexity. Whenever you surf the web, a whole series of networking protocols is being kicked into gear to set up connections to computers all over the world and retrieve data, all in a matter of seconds. Consider, for instance, the following series of steps that gets executed by your web browser once you navigate to a website, say www.google.com:

1. You enter “www.google.com” into your web browser, which needs to figure out the IP address for this site. IP stands for “Internet Protocol” and forms a core protocol of the Internet, as it enables networks to route and redirect communication packets between connected computers, which are all given an IP address. To communicate with Google’s web server, you need to know its IP address. Since the IP address is basically a number, it would be kind of annoying to remember all these numbers for every website out there. So, just as how you link telephone numbers to names in your phone’s contact book, the web provides a mechanism to translate domain names like “www.google.com” to an IP address.
2. And so, your browser sets off to figure out the correct IP address behind “www.google.com”. To do so, your web browser will use another protocol, called DNS (which stands for Domain Name System) as follows: first, the web browser will inspect its own cache (its “short term memory”) to see whether you’ve recently visited this website in the past. If you have, the browser can reuse the stored address. If not, the browser will ask the underlying operating system (Windows, for example) to see whether it knows the address for www.google.com.
3. If the operating system is also unaware of this domain, the browser will send a DNS request to your router, which is the machine that connects you to the Internet and also — typically —keeps its own DNS cache. If your router is also unaware of the correct address, your browser will start sending a number of data packets to known DNS servers, for example, to the DNS server maintained by your Internet Service Provider (ISP) — for which the internet IP address(wan IP) is known and stored in your router. The DNS server will then reply with a response basically indicating that “www.google.com” is mapped to the IP address “172.217.17.68”. Note that even your ISPs DNS server might have to ask other DNS servers (located higher in the DNS hierarchy) in case it doesn’t have the record at hand.
4. All of this was done just to figure out the IP address of www.google.com. Your browser can now establish a connection to 172.217.17.68, Google’s web server. A number of protocols — a protocol is a standard agreement regarding what messages between communicating parties should look like — are combined here (wrapped around each other, if you will) to construct a complex message. At the outermost part of this “onion,” we find the IEEE 802.3 (Ethernet) protocol, which is used to communicate with machines on the same network.
Since we’re not communicating on the same network, the Internet Protocol, IP, is used to embed another message indicating that we wish to contact the server at address 172.217.17.68. Inside this, we find another protocol, called TCP (Transmission Control Protocol), which provides a general, reliable means to deliver network messages, as it includes functionality for error checking and splitting messages up in smaller packages, thereby ensuring that these packets are delivered in the right order. TCP will also resend packets when they are lost in transmission. Finally, inside the TCP message, we find another message, formatted according to the HTTP protocol (HyperText Transfer Protocol), which is the actual protocol used to request and receive web pages. Basically, the HTTP message here states a request from our web browser: “Can I get your index page, please?”
5. Google’s web server now sends back an HTTP reply, containing the contents of the page we want to visit. In most cases, this textual content is formatted using HTML, a markup language we’ll take a closer look at later on. From this (oftentimes large) bunch of text, our web browser can set off to render the actual page, that is, making sure that everything appears neatly[ˈniːtli]整洁地；熟练地 on screen as instructed by the HTML content. Note that a web page will oftentimes contain pieces of content for which the web browser will — behind the scenes — initiate new HTTP requests. In case the received page instructs[ɪnˈstrʌkts]命令 the browser to show an image, for example, the browser will fire off another HTTP request to get the contents of the image (which will then not look like HTML formatted text but simply as raw, binary data). As such, rendering just one web page might involve a large deal of HTTP requests. Luckily, modern browsers are smart and will start rendering the page as soon as information is coming in, showing images and other visuals as they are retrieved. In addition, browsers will try to send out multiple requests in parallel if possible to speed up this process as well.

With so many protocols, requests, and talking between machines going on, it is nothing short of amazing that you are able to view a simple web page in less than a second. To standardize the large amount of protocols that form the web, the International Organization of Standardization (ISO) maintains the Open Systems Interconnection (OSI) model, which organizes computer communication into seven layers:

• Layer 1: Physical Layer: Includes the Ethernet protocol, but also USB, Bluetooth, and other radio protocols.
• Layer 2: Data link Layer: Includes the Ethernet protocol.
• Layer 3: Network Layer: Includes IP (Internet Protocol).
• Layer 4: Transport Layer: TCP, but also protocols such as UDP, which do not offer the advanced error checking and recovery mechanisms of TCP for sake of speed.
• Layer 5: Session Layer: Includes protocols for opening/closing and managing sessions.
• Layer 6: Presentation Layer: Includes protocols to format and translate data.
• Layer 7: Application Layer: HTTP and DNS, for instance.

Not all network communications need to use protocols from all these layers. To request a web page, for instance, layers 1 (physical), 2 (Ethernet), 3 (IP), 4 (TCP), and 7 (HTTP) are involved, but the layers are constructed so that each protocol found at a higher level can be contained inside the message of a lower-layer protocol. When you request a secure web page, for instance, the HTTP message (layer 7) will be encoded in an encrypted message (layer 6) (this is what happens if you surf to an “https”-address). The lower the layer you aim for when programming networked applications, the more functionality and complexity you need to deal with. Luckily for us web scrapers, we’re interested in the topmost layer, that is, HTTP, the protocol used to request and receive web pages. That means that we can leave all complexities regarding TCP, IP, Ethernet, and even resolving domain names with DNS up to the Python libraries we use, and the underlying operating system.

2.2 The HyperText Transfer Protocol: HTTP

We’ve now seen how your web browser communicates with a server on the World Wide Web. The core component in the exchange of messages consists of a HyperText Transfer Protocol (HTTP) request message to a web server, followed by an HTTP response (also oftentimes called an HTTP reply), which can be rendered by the browser. Since all of our web scraping will build upon HTTP, we do need to take a closer look at HTTP messages to learn what they look like.

HTTP is, in fact, a rather simple networking protocol. It is text based, which at least makes its messages somewhat readable to end users (compared to raw binary messages that have no textual structure at all) and follow a simple request-reply-based communication scheme. That is, contacting a web server and receiving a reply simply involves two HTTP messages: a request and a reply. In case your browser wants to download or fetch additional resources (such as images), this will simply entail additional request-reply messages being sent.

Keep Me Alive In the simplest case, every request-reply cycle in HTTP involves setting up a fresh new underlying TCP connection as well. For heavy websites, setting up many TCP connections and tearing them down in quick succession creates a lot of overhead, so HTTP version 1.1 allows us to keep the TCP connection “alive” to be used for concurrent request-reply HTTP messages. HTTP version 2.0 even allows us to “multiplex”多路复用 (a fancy word for “mixing messages”) in the same connection, for example, to send multiple concurrent requests多个并发请求. Luckily, we don’t need to concern ourselves much with these details while working with Python, as the requests, the library we’ll use, takes care of this for us automatically behind the scenes.

Let us now take a look at what an HTTP request and reply look like. As we recall, a client (the web browser, in most cases) and web server will communicate by sending plain text messages. The client sends requests to the server and the server sends responses, or replies.

A request message consists of the following:

• A request line;
• A number of request headers, each on their own line;
• An empty line;
• An optional message body, which can also take up multiple lines. Each line in an HTTP message must end with <CR><LF> (the ASCII characters 0D and 0A).

The empty line is simply <CR><LF> with no other additional white space.

New Lines <CR> and <LF> are two special characters to indicate that a new line should be started. You don’t see them appearing as such, but when you type out a plain text document in, say, Notepad, every time you press enter, these two characters(<CR><LF>) will be put inside of the contents of the document to represent “that a new line appears here.” An annoying aspect of computing is that operating systems do not always agree on which character to use to indicate a new line. Linux programs tend to use <LF> (the “line feed character), whereas older versions of MacOS used <CR> (the “carriage return” character). Windows uses both <CR> and <LF> to indicate a new line, which was also adopted by the HTTP standard. Don’t worry too much about this, as the Python requests library will take care of correctly formatting the HTTP messages for us.

The following code fragment shows a full HTTP request message as executed by a web browser (we don’t show the “<CR><LF>” after each line, except for the last, blank line):

GET / HTTP/1.1
Host: example.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User- Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Referer: https://www.google.com/
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8,nl;q=0.6
<CR><LF>

Let’s take a closer look at this message. “GET / HTTP/1.1” is the request line. It contains the HTTP “verb” or “method” we want to execute (“GET” in the example above), the URL we want to retrieve (“/”), and the HTTP version we understand (“HTTP/1.1”). Don’t worry too much about the “GET” verb. HTTP has a number of verbs (that we’ll discuss later on). For now, it is important to know that “GET” means this: “get the contents of this URL for me.” Every time you enter a URL in your address bar and press enter, your browser will perform a GET request.

Next up are the request headers, each on their own line. In this example, we already have quite a few of them. Note that each header includes a name (“Host,” for instance ), followed by a colon (“:”) and the actual value of the header (“example.com”). Browsers are very chatty in terms of what they like to include in their headers, and Chrome (the web browser used here, is no exception).

The HTTP standard includes some headers that are standardized and which will be utilized by proper web browsers, though you are free to include additional headers as well. “Host,” for instance, is a standardized and mandatory header in HTTP 1.1 and higher. The reason why it was not around in HTTP 1.0 (the first version) is simple: in those days, each web server (with its IP address) was responsible for serving one particular website. If we would hence send “GET / HTTP/1.1” to a web server responsible for “example.com”, the server knew which page to fetch and return. However, it didn’t take long for the following bright idea to appear: Why not serve multiple websites from the same server, with the same IP address? The same server responsible for “example.com” might also be the one serving pages belonging to “example.org”, for instance. However, we then need a way to tell the server which domain name we’d like to retrieve a page from. Including the domain name in the request line itself, like “GET example.org/ HTTP/1.1” might have been a solid idea, though this would break backward compatibility with earlier web servers, which expect a URL without a domain name in the request line. A solution was then offered in the form of a mandatory “Host” header, indicating from which domain name the server should retrieve the page.

Wrong Host Don’t try to be too clever and send a request to a web server responsible for “example.com” and change the “Host” header to read “Host: somethingentirely-different.com”. Proper web servers will complain and simply send back an error page saying: “Hey, I’m not the server hosting that domain.” This being said, security issues have been identified on websites where it is possible to confuse and misdirect them by spoofing this header.

Apart from the mandatory “Host” header, we also see a number of other headers appearing that form a set of “standardized requests headers,” which are not mandatory, though nevertheless included by all modern web browsers. “Connection: keep-alive,” for instance, sign posts to the server that it should keep the connection open for subsequent requests if it can. The “User-Agent” contains a large text value through which the browser happily informs the server what it is (Chrome), and which version it is running as.

The User-Agent Mess Well… you’ll note that the “User-Agent” header contains “Chrome,” but also a lot of additional seemingly unrelated text such as “Mozilla,” “AppleWebKit,” and so on. Is Chrome masquerading itself and posing as other browsers? In a way, it it, though it is not the only browser that does so. The problem is this: when the “User-Agent” header came along and browsers started sending their names and version, some website owners thought it was a good idea to check this header and reply with different versions of a page depending on who’s asking, for instance to tell users that “Netscape 4.0” is not supported by this server. The routines responsible for these checks were often implemented in a haphazardly way, thereby mistakenly sending users off when they’re running some unknown browser, or failing to correctly check the browser’s version. Browser vendors hence had no choice over the years to get creative and include lots of other text fields in this User-Agent header. Basically, our browser is saying “I’m Chrome, but I’m also compatible with all these other browsers, so just let me through please.”

Mozilla/MozillaVersion (Platform; Encryption; OS-or-CPU; Language;
PrereleaseVersion)Gecko/GeckoVersion
ApplicationProduct/ApplicationProductVersion

STRING	REQUIRED?	DESCRIPTION
MozillaVersion	Yes	The version of Mozilla.
Platform	Yes	The platform on which the browser is running. Possible values include Windows, Mac, and X11 (for Unix X-windows systems).
Encryption	Yes	Encryption capabilities: U for 128-bit, I for 40-bit, or N for no encryption.
OS-or-CPU	Yes	The operating system the browser is being run on or the processor type of the computer running the browser. If the platform is Windows, this is the version of Windows (such as WinNT, Win95, and so on). If the platform is Macintosh, then this is the CPU (either 68k, PPC for PowerPC, or MacIntel). If the Platform is X11, this is the Unix operating-system name as obtained by the Unix command uname -sm.
Language	Yes	The language that the browser was created for use in.
Prerelease Version	No	Originally intended as the prerelease version number for Mozilla, it now indicates the version number of the Gecko rendering engine.
GeckoVersion	Yes	The version of the Gecko rendering engine represented by a date in the format yyyymmdd.
ApplicationProduct	No	The name of the product using Gecko. This may be Netscape, Firefox, and so on.
ApplicationProductVersion	No	The version of the ApplicationProduct; this is separate from the MozillaVersion and the GeckoVersion.

User-Agent:

Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

Mozilla/5.0 : This string indicates Netscape Navigator 5.0 is being used

(OS-or-CPU操作系统或CPU;Platform; Encryption) AppleWebKit/AppleWebKit/Version (KHTML, like Gecko) Chrome/ChromeVersion Safari/SafariVersionhttps://developer.chrome.com/docs/multidevice/user-agent/

“Accept” tells the server which forms of content the browser prefers to get back, and
“Accept-Encoding” tells the server that the browser is also able to get back compressed
content.

The “Referer” header (a deliberate misspelling) tells the server from which page
the browser comes from (in this case, a link was clicked on “google.com” sending the
browser to “example.com”).

A Polite Request Even though your web browser will try to behave politely and, for instance, tell the web server which forms of content it accepts, there is no guarantee whatsoever that a web server will actually look at these headers or follow up on them. A browser might indicate in its “Accept” header that it understands “webp” images, but the web server can just ignore this request and send back images as “jpg” or “png” anyway. Consider these request headers as polite requests, though, nothing more.

Finally, our request message ends with a blank <CR><LF> line, and has no message body whatsoever. These are not included in GET requests, but we’ll see HTTP messages later on where this message body will come into play.

If all goes well, the web server will process our request and send back an HTTP reply. These look very similar to HTTP requests and contain:

• A status line that includes the status code and a status message;
• A number of response headers, again all on the same line;
• An empty line;

• An optional message body.

HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html;charset=utf-8
Date: Mon, 28 Aug 2017 10:57:42 GMT
Server: Apache v1.3
Vary: Accept-Encoding
Transfer-Encoding: chunked
<CR><LF>
<html>
<body>Welcome to My Web Page</body>
</html>

Again, let’s take a look at the HTTP reply line by line. The first line indicates the status result of the request. It opens by listing the HTTP version the server understands (“HTTP/1.1”), followed by a status code (“200”), and a status message (“OK”). If all goes well, the status will be 200. There are a number of agreed-upon HTTP status codes that we’ll take a closer look at later on, but you’re probably also familiar with the 404 status message, indicating that the URL listed in the request could not be retrieved, that was, was “not found” on the server.

Next up are — again — a number of headers, now coming from the server. Just like web browsers, servers can be quite chatty in terms of what they provide, and can include as many headers as they like. Here, the server includes its current date and version (“Apache v1.3”) in its headers. Another important header here is “Content-Type,” as it will provide browsers with information regarding what the content included in the reply looks like. Here, it is HTML text, but it might also be binary image data, movie data, and so on.

Following the headers is a blank <CR><LF> line, and an optional message body, containing the actual content of the reply. Here, the content is a bunch of HTML text containing “Welcome to My Web Page.” It is this HTML content that will then be parsed by your web browser and visualized on the screen. Again, the message body is optional, but since we expect most requests to actually come back with some content, a message body will be present in almost all cases.

Message Bodies Even when the status code of the reply is 404, for instance, many websites will include a message body to provide the user with a nice looking page indicating that — sorry — this page could not be found. If the server leaves it out, the web browser will just show its default “Page not found” page instead. There are some other cases where an HTTP reply does not include a message body, which we’ll touch upon later on.

2.3 HTTP in Python: The Requests Library

We’ve now seen the basics regarding HTTP, so it is time we get our hands dirty with some Python code. Recall the main purpose of web scraping: to retrieve data from the web in an automated manner. Basically, we’re throwing out our web browser and we’re going to surf the web using a Python program instead. This means that our Python program will need to be able to speak and understand HTTP.

Definitely, we could try to program this ourselves on top of standard networking functionality already built-in in Python (or other languages, for that manner), making sure that we neatly format HTTP request messages and are able to parse the incoming responses. However, we’re not interested in reinventing the wheel, and there are many Python libraries out there already that make this task a lot more pleasant, so that we can focus on what we’re actually trying to accomplish.

In fact, there are quite a few libraries in the Python ecosystem that can take care of HTTP for us. To name a few:

• Python 3 comes with a built-in module called “urllib,” which can deal with all things HTTP (see urllib — URL handling modules — Python 3.9.7 documentation). The module got heavily revised compared to its counterpart in Python 2, where HTTP functionality was split up in both “urllib” and “urllib2” and somewhat cumbersome to work with.
• “httplib2” (see https://github.com/httplib2/httplib2): a small, fast HTTP client library. Originally developed by Googler Joe Gregorio, and now community supported.
• “urllib3” (see https://urllib3.readthedocs.io/): a powerful HTTP client for Python, used by the requests library below.
• “requests” (see http://docs.python-requests.org/): an elegant and simple HTTP library for Python, built “for human beings.”
• “grequests” (see https://pypi.python.org/pypi/grequests): which extends requests to deal with asynchronous, concurrent HTTP requests.
• “aiohttp” (see http://aiohttp.readthedocs.io/): another library focusing on asynchronous HTTP.

we’ll use the “requests” library to deal with HTTP. The reason why is simple: whereas “urllib” provides solid HTTP functionality (especially compared with the situation in Python 2), using it often involves lots of boilerplate code making the module less pleasant to use and not very elegant to read. Compared with “urllib,” “urllib3” (not part of the standard Python modules) extends the Python ecosystem regarding HTTP with some advanced features, but it also doesn’t really focus that much on being elegant or concise. That’s where “requests” comes in. This library builds on top of “urllib3,” but it allows you to tackle the majority of HTTP use cases in code that is short, pretty, and easy to use. Both “grequests” and “aiohttp” are more modern-oriented libraries and aim to make HTTP with Python more asynchronous. This becomes especially important for very heavy-duty applications where you’d have to make lots of HTTP requests as quickly as possible. We’ll stick with “requests” in what follows, as asynchronous programming is a rather challenging topic on its own, and we’ll discuss more traditional ways of speeding up your web scraping programs in a robust manner. It should not be too hard to move on from “requests” to “grequests” or “aiohttp” (or other libraries) should you wish to do so later on.

Installing requests can be done easily through pip (refer back to section 1.2.1 if you still need to set up Python 3 and pip). Execute the following in a command-line window (the “-U” argument will make sure to update an existing version of requests should there already be one):

pip install -U requests

Next, create a Python file (“firstexample.py” is a good name), and enter the following:

We use the requests.get method to perform an “HTTP GET” request to the provided URL.

import requests

url = 'http://www.webscrapingfordatascience.com/basichttp/'
r = requests.get( url )

The requests.get method returns a requests.Response Python object containing lots of information regarding the HTTP reply that was retrieved. Again, requests takes care of parsing the HTTP reply so that you can immediately start working with it.

The HTTP response content:

print( r.text )

r.text contains the HTTP response content body in a textual form. Here, the HTTP response body simple contained the content “Hello from the web!”

If all goes well, you should see the following line appear when executing this script:

A More Generic Request Since we’ll be working with HTTP GET requests only (for now), the requests.get method will form a cornerstone of the upcoming examples. Later on, we’ll also deal with other types of HTTP requests, such as POST. Each of these come with a corresponding method in requests, for example, requests.post. There’s also a generic request method that looks like this: requests.request('GET', url). This is a bit longer to write, but might come in handy in cases where you don’t know beforehand which type of HTTP request (GET, or something else) you’re going to make.

Let us expand upon this example a bit further to see what’s going on under the hood:

Which HTTP status code did we get back from the server?

print( r.status_code )

What is the textual status code?

print(r.reason)

By using the status_code and reason attributes of a request.Response object, we can retrieve the HTTP status code and associated text message we got back from the server. Here, a status code and message of “200 OK” indicates that everything went well.

What were the HTTP response headers?

print(r.headers)

The headers attribute of the request.Response object returns a dictionary of the headers the server included in its HTTP reply. Again: servers can be pretty chatty. This server reports its data, server version, and also provides the “Content-Type” header.

The request information is saved as a Python object in r.request:

print( r.request )

To get information regarding the HTTP request that was fired off, you can access the request attribute of a request.Response object. This attribute itself is a request.Request object, containing information about the HTTP request that was prepared.

What were the HTTP request headers?

print( r.request.headers )

Since an HTTP request message also includes headers, we can access the headers attribute for this object as well to get a dictionary representing the headers that were included by requests. Note that requests politely reports its “User-Agent” by default. In addition, requests can take care of compressed pages automatically as well('gzip, deflate'), so it also includes an “Accept-Encoding” header to signpost this. Finally, it includes an “Accept” header to indicate that “any format you have can be sent back” and can deal with “keep-alive” connections as well. Later on, however, we’ll see cases where we need to override requests’ default request header behavior.

2.4 Query Strings: URLs with Parameters

There’s one more thing we need to discuss regarding the basic working of HTTP: URL parameters. Try adapting the code example above in order to scrape the URL http://www.webscrapingfordatascience.com/paramhttp/. You should get the following content:

Please provide a "query" parameter

Try opening this page in your web browser to verify that you get the same result. Now
try navigating to the page http://www.webscrapingfordatascience.com/paramhttp/?query=test. What do you see?

The optional “?…” part in URLs is called the “query string,” and it is meant to contain data that does not fit within a URL’s normal hierarchical path structure. You’ve probably encountered this sort of URL many times when surfing the web, for example:

Web servers are smart pieces of software. When a server receives an HTTP request for such URLs, it may run a program that uses the parameters included in the querystring — the “URL parameters” — to render different content. Compare http://www.webscrapingfordatascience.com/paramhttp/?query=test with http://www.webscrapingfordatascience.com/paramhttp/?query=anothertest, for instance. Even for this simple page, you see how the response dynamically incorporates the parameter data that you provided in the URL.

Query strings in URLs should adhere to the following conventions:

• A query string comes at the end of a URL, starting with a single question mark, “?”.
• Parameters are provided as key-value pairs and separated by an ampersand, “&”.
• The key and value are separated using an equals sign, “=”.
• Since some characters cannot be part of a URL or have a special meaning (the characters “/”, “?”, “&”, and “=” for instance), URL “encoding” needs to be applied to properly format such characters when using them inside of a URL. Try this out using the URL http://www.webscrapingfordatascience.com/paramhttp/?query=another%20test%3F%26, which sends “another test?&” as the value for the “query” parameter to the server in an encoded form.
• Other exact semantics are not standardized. In general, the order in which the URL parameters are specified is not taken into account by web servers, though some might. Many web servers will also be able to deal and use pages with URL parameters without a value, for example, http://www.example.com/?noparam=&anotherparam. Since the full URL is included in the request line of an HTTP request, the web server can decide how to parse and deal with these.
URL Rewriting This latter remark also highlights another important aspect regarding URL parameters: even although they are somewhat standardized, they’re not treated as being a “special” part of a URL, which is just sent as a plain text line in an HTTP request anyway. Most web servers will pay attention to parse them on their end in order to use their information while rendering a page (or even ignore them when they’re unused — try the URL http://www.webscrapingfordatascience.com/paramhttp/?query=test&other=ignored, for instance), but in recent years, the usage of URL parameters is being avoided somewhat. Instead, most web frameworks will allow us to define “nice looking” URLs that just include the parameters in the path of a URL, for example “/product/302/” instead of “products.html?p=302”. The former looks nicer when looking at the URL as a human, and search engine optimization (SEO) people will also tell you that search engines prefer such URLs as well. On the server-side of things, any incoming URL can hence be parsed at will, taking pieces from it and “rewriting” it, as it is called, so some parts might end up being used as input while preparing a reply. For us web scrapers, this basically means that even although you don’t see a query string in a URL, there might still be dynamic parts in the URL to which the server might respond in different ways.

Let’s take a look at how to deal with URL parameters in requests. The easiest way to deal with these is to include them simply in the URL itself:

import requests

url = 'http://www.webscrapingfordatascience.com/paramhttp/?query=test'
r = requests.get(url)

print(r.text)

In some circumstances, requests will try to help you out and encode some characters for you:

import requests

url = 'http://www.webscrapingfordatascience.com/paramhttp/?query=a query with spaces'
r = requests.get(url)

Parameter will be encoded as 'a%20query%20with%20spaces' You can verify this be looking at the prepared request URL:

print(r.request.url)

However, sometimes the URL is too ambiguous for requests to make sense of it过于模糊，请求无法理解它:

import requests

url = 'http://www.webscrapingfordatascience.com/paramhttp/?query=complex?&'
r = requests.get(url)

Parameter will not be encoded

In this case, requests is unsure whether you meant “?&” to belong to the actual URL as is or whether you wanted to encode it. Hence, requests will do nothing and just request the URL as is. On the server-side, this particular web server is able to derive that the second question mark (“?”) should be part of the URL parameter (and should have been properly encoded, but it won’t complain), though the ampersand “&” is too ambiguous in this case. Here, the web server assumes that it is a normal separator and not part of the URL parameter value.

Solution:

So how then, can we properly resolve this issue? A first method is to use the “urllib. parse” functions quote and quote_plus. The former is meant to encode special characters in the path section of URLs and encodes special characters using percent “%XX” encoding, including spaces. The latter does the same, but replaces spaces by plus signs, and it is generally used to encode query strings:

import requests
from urllib.parse import quote, quote_plus

raw_string = 'a query with /, spaces and?&'

print( quote(raw_string) )

The quote function applies percent encoding, but leaves the slash (“/”) intact (as its default setting, at least) as this function is meant to be used on URL paths.

print( quote_plus(raw_string) )

The quote_plus function does apply a similar encoding, but uses a plus sign (“+”) to encode spaces and will also encode slashes.

As long as we make sure that our query parameter does not use slashes, both encoding approaches are valid to be used to encode query strings. In case our query string does include a slash, and if we do want to use quote, we can simply override its safe argument as done below:

import requests
from urllib.parse import quote, quote_plus

raw_string = 'a query with /, spaces and?&'
url = 'http://www.webscrapingfordatascience.com/paramhttp/?query='

print( '\nUsing quote:' )
# Nothing is safe, not even '/' characters, so encode everything

r = requests.get( url + quote(raw_string, safe='') )
print( r.url )
print( r.text )

Using quote: # a query with /, spaces and?&
http://www.webscrapingfordatascience.com/paramhttp/?query=a%20query%20with%20%2F%2C%20spaces%20and%3F%26
I don't have any information on "a query with /, spaces and?&"

print( '\nUsing quote_plus:' )
r = requests.get( url + quote_plus(raw_string) )
print( r.url )
print( r.text )

Using quote_plus: # a query with /, spaces and?&
http://www.webscrapingfordatascience.com/paramhttp/?query=a+query+with+%2F%2C+spaces+and%3F%26
I don't have any information on "a query with /, spaces and?&"

All this encoding juggling can quickly lead to a headache. Wasn’t requests supposed to make our life easy and deal with this for us? Not to worry, as we can simply rewrite the example above using requests only as follows:

import requests

url = 'http://www.webscrapingfordatascience.com/paramhttp/'
parameters = {
                'query': 'a query with /, spaces and?&'
             }
r = requests.get(url, params=parameters)

print(r.url)
print(r.text)

http://www.webscrapingfordatascience.com/paramhttp/?query=a+query+with+%2F%2C+spaces+and%3F%26
I don't have any information on "a query with /, spaces and?&"

Note the usage of the params argument in the requests.get method: you can simply
pass a Python dictionary with your non-encoded URL parameters and requests will take
care of encoding them for you.

Empty and Ordered Parameters

Empty parameters, for example, as in “params={’query’: ”}” will end up in the URL with an equals sign included, that is, “?query=”.
If you want, you can also pass a list to params with every element being a tuple or list itself having two elements representing the key and value per parameter respectively, in which case the order of the list will be respected.
You can also pass an OrderedDict object (a built-in object provided by the “collections” module in Python 3) that will retain the ordering.
Finally, you can also pass a string representing your query string part. In this case, requests will prepend the question mark (“?”) for you, but will — once again — not be able to provide smart URL encoding, so that you are responsible to make sure your query string is encoded properly. Although this is not frequently used, this can come in handy in cases where the web server expects an “?param” without an equals sign at the end, for instance — something that rarely occurs in practice, but can happen.

Silencing requests Completely

Even when passing a string to params, or including the full url in the requests.get method, requests will still try, as we have seen, to help out a little. For instance, writing:

requests.get('http://www.example.com/?spaces |pipe')

will make you end up with “?spaces%20%7Cpipe” as the query string in the request URL, with the space and pipe (“|”) characters encoded for you. In rare situations, a very picky web server might nevertheless expect URLs to come in unencoded. Again, cases such as these are extremely rare, but we have encountered situations in the wild where this happens. In this case, you will need to override requests as follows:

import requests
from urllib.parse import unquote

class NonEncodedSession(requests.Session):
    # Override the default send method
    def send(self, *a, **kw):
        # Revert the encoding which was applied
        a[0].url = unquote(a[0].url)
        return requests.Session.send(self, *a, **kw)
    
my_requests = NonEncodedSession()
url = 'http://www.example.com/?spaces |pipe'
r = my_requests.get(url)

print(r.url)
# Will show: http://www.example.com/?spaces |pipe

As a final exercise, head over to http://www.webscrapingfordatascience.com/calchttp/. Play around with the “a,” “b,” and “op” URL parameters. You should be able to work out what the following code does:

import requests

def calc(a, b, op):
    url = 'http://www.webscrapingfordatascience.com/calchttp/'
    params = {'a': a, 'b': b, 'op': op}
    
    r = requests.get(url, params=params)
    return r.text

print(calc(4, 6, '*'))
print(calc(4, 6, '/'))

Based on what we’ve seen above, you’ll probably feel itchy to try out what you’ve learned using a real-life website. However, there is another hurdle we need to pass before being web ready. What happens, for instance, when you run the following:

import requests

url = 'https://en.wikipedia.org/w/index.php' + \
      '?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
    
print(r.text)

As you can see, the response body captured by r.text now spits out a slew of confusing-looking text. This is HTML-formatted text, and although the content we’re looking for is buried somewhere inside this soup, we’ll need to learn about a proper way to get out the information we want from there. That’s exactly what we’ll do in the next chapter.

Wikipedia Versioning维基百科版本控制

We’re using the “oldid” URL parameter here such that we obtain a specific version of the “List of Game of Thrones episodes” page, to make sure that our subsequent examples will keep working. By the way, here you can see “URL rewriting” in action: both https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes and https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes lead to the exact same page. The difference is that the latter uses URL parameters and the former does not, though Wikipedia’s web server is clever enough to route URLs to their proper “page.” Also, you might note that we’re not using the params argument here. We could, though neither the “title” nor “oldid” parameters require encoding here, so we can just stick them in the URL itself to keep the rest of the code a bit shorter.

The Fragment Identifier片段标识符

Apart from the query string, there is in fact another optional part of the URL that you might have encountered before: the fragment identifier, or “hash,” as it is sometimes called. It is prepended by a hash mark (“#”) and comes at the very end of a URL, even after the query string, for instance, as in “http://www.example.org/about.htm?p=8#contact”. This part of the URL is meant to identify a portion of the document corresponding to the URL. For instance, a web page can include a link including a fragment identifier that, if you click on it, immediately scrolls your view to the corresponding part of the page. However, the fragment identifier functions differently than the rest of the URL, as it is processed exclusively by the web browser with no participation at all from the web server. In fact, proper web browsers should not even include the fragment identifier in their HTTP requests when they fetch a resource from a web server. Instead, the browser waits until the web server has sent its reply, and it will then use the fragment identifier to scroll to the correct part of the page. Most web servers will simply ignore a fragment identifier if you would include it in a request URL, although some might be programmed to take them into account as well. Again: this is rather rare, as the content provided by such a server would not be viewable by most web browsers, as they leave out the fragment identifier part in their requests, though the web is full of interesting edge cases.

We’ve now seen the basics of the requests library. Take some time to explore the documentation of the library available at http://docs.python-requests.org/en/master/. The quality of requests’ documentation is very high and easy to refer to once you start using the library in your projects.

LIQING LIN

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
p2_The Web Speaks HTTP_＜CR＞＜LF＞_User-Agent_Query strings in URLs

In this chapter, we introduce one of the core building blocks that makes up the web: the HyperText Transfer Protocol (HTTP), after having provided a brief introduction to computer networks in general. We then introduce the Python requests library, whi...
复制链接

扫一扫