matlab 爬虫 中文网址,matlab 网络爬虫 urlread2

很多朋友没有安装matlab

2016b及以上的版本,因此没有webread这个函数,而在mathworks上发现了一个urlread的扩展函数urlread2函数,连接urlread2 ,这是大牛Jim Hokanson 利用java编写的,参见Expanding urlread

capabilities

【Introduction

】简介

HTTP is the underlying computer

networking protocol that enables us to read webpages on the

Internet. It consists of a request made by the user to an Internet

server (typically located via URL), and a response from that

server. Importantly, the request and response consist of three main

parts: a resource line (for requests) or status line (for

responses), followed by headers, and optionally a message

body.

HTTP是使我们能够在因特网上读取网页的底层计算机网络协议。它由用户向互联网服务器(通常通过URL定位)和来自该服务器的响应组成的请求。重要的是,请求和响应包括三个主要部分:资源行(用于请求)或状态行(用于响应),接着是报头,并且可选地是消息体。

Matlab’s built-in urlread

function enables Matlab users to easily read the server’s response

text into a Matlab string:

text =

urlread('http://www.google.com');

MATLAB的内置urlread函数使MATLAB用户能够轻松地读取服务器的响应文本到MATLAB字符串中:

This is done internally using

Java code that connects to the specified URL and reads the

information sent by the URL’s server (more on this).

这是内部使用的java代码,连接到指定的URL和读取的URL的服务器发送的信息。

urlread accepts optional

additional inputs specifying the request type (‘get’ or ‘post’) and

parameter values for the request.

urlread接受可选的附加输入,指定请求类型(“get”或“post”)和请求的参数值。

Unfortunately, urlread has the

following limitations:

It does not allow specification

of request headers

It makes assumptions as to the

request headers needed based on the input method

It does not expose the response

headers and status line

It assumes the response body

contains text, and not a binary payload

It does not enable uploading

binary contents to the server

It does not enable specifying a

timeout in case the server is not responding

不幸的是,URLRead有以下限制:

1.它不允许请求报头的规范。

2.它根据输入法对所需的请求报头进行假设。

3.它不公开响应标题和状态行。

4.它假定响应体包含文本,而不是二进制有效载荷。

5.它无法将二进制内容上传到服务器。

6.如果服务器不响应,则无法启用指定超时。

urlread2

The urlread2 function addresses

all of these problems. The overall design decision for this

function was to make it more general, requiring more work up front

to use in some cases, but more flexibility.

urlread函数解决了所有这些问题。该功能的总体设计决定是使其更通用,需要在某些情况下使用更多的工作,但更灵活。

【语法结构】

For reference, the following is

the calling format for urlread2 (which is reminiscent of

urlread‘s):

urlread2(url,*method,*body,*headersIn,

varargin)

The * indicate optional inputs

that must be spatially maintained.

url – (string), url to

request

method – (string, default GET)

HTTP request method

body – (string, default ”),

body of the request

headersIn – (structure, default

[]), see the following section

varargin – extra properties

that need to be specified via property/pair values

Addressing Problem 1 –

Request header

urlread internally uses a Java

object called urlConnection that is generally an instance of the

class sun.net.www.protocol.http.HttpURLConnection. The method

setRequestProperty() can be used to set headers for the request.

This method has two inputs, the header name and the value of that

header. A simple example of this can be seen below:

urlConnection.setRequestProperty('Content-Type','application/x-www-form-urlencoded');

Here ‘Content-Type’ is the

header name and the second input is the value of that property. My

function requires passing in nearly all headers as a structure

array, with fields for the name and value. The preceding header

would be created using a helper function

http_createHeader.m:

header =

http_createHeader('Content-Type','application/x-www-form-urlencoded');

Multiple headers can be passed

in to the function by concatenating header structures into a

structure array.

Addressing Problem 2 –

Request parameters

When making a POST request,

parameters are generally specified in the message body using the

following format:

[property]=[value]&[property]=[value]

The properties and values are

also encoded in a particular way, generally termed urlencoded

(encoding and decoding can be done using Matlab’s built-in

urlencode and urldecode functions). For GET requests this string is

appended to the url with the “?” symbol. Since urlencoding methods

can vary, and in the spirit of reducing assumptions, I use separate

functions to generate these strings outside of urlread2, and then

pass the result in either as the url (for GET) or as the body input

(for POST). As an example, I might search the Mathworks website

using the upper right search bar on its site for “undocumented

matlab” under file exchange (hmmm… pretty cute stuff there!). Doing

this performs a GET request with the following property/value

pairs:

params =

{'search_submit','fileexchange', 'term','undocumented matlab',

'query','undocumented matlab'};

These property/value pairs are

somewhat obvious from looking at the URL, but could also be

determined by using programs such as Fiddler, Firebug, or

HttpWatch.

After urlencoding and

concatenating, we would form the following string:

search_submit=fileexchange&term=undocumented+matlab&query=undocumented+matlab

This functionality is normally

accomplished internally in urlread, but I use a function

http_paramsToString to produce that result. That function also

returns the required header for POST requests. The following is an

example of both GET and POST requests:

[queryString,header] =

http_paramsToString(params,1);

% For GET:

url = [url '?'

queryString];

urlread2(url)

% For POST:

urlread2(url,'POST',queryString,header)

Addressing Problem 3 –

Response header

According to the HTTP protocol,

each server response starts with a simple header that indicates a

numeric response status. The following Matlab code provides access

to the status line using the urlConnection object:

status =

struct('value',urlConnection.getResponseCode(),

'msg',char(urlConnection.getResponseMessage))

status

=

value: 200

msg: 'OK'

urlConnection‘s

getHeaderField() and getHeaderFieldKey() methods enable reading the

specific parts of the response header:

headerValue =

char(urlConnection.getHeaderField(headerIndex));

headerName =

char(urlConnection.getHeaderFieldKey(headerIndex));

headerIndex starts at 0 and

increases by 1 until both headerValue and headerName return

empty.

It is important to note that

header keys (names) can be repeated for different values. Sometimes

this is desired, such as if there are multiple cookies being sent

to the user. To generically handle this case, two header structures

are returned. In both cases the header names are the field names in

the structure, after replacing hyphens with underscores. In one

case, allHeaders, the values are cell arrays of strings containing

all values presented with the particular key. The other structure,

firstHeaders, contains only the first instance of the header as a

string to avoid needing to dereference a cell array.

Addressing Problem 4 –

Response body

urlread assumes text output.

This is fine for most webpages, which use HTML and are therefore

text-based. However, urlread fails when trying to download any

non-text resource such as an image, a ZIP file, or a PDF document.

I have added a flag in urlread2 called CAST_OUTPUT, which defaults

to true, i.e. text response, just as urlread assumes. Using

varargin, this flag can be set to false ({‘CAST_OUTPUT’,false}) to

indicate a binary response.

Summary

urlread2‘s functionality has

been expanded to also address other limitations of urlread: It

enables binary inputs, better character-set handling of the output,

redirection following, and read timeouts.

The modifications described

above provide direct access to the key components of the HTTP

request and response messages. Its more generic nature lets

urlread2 focus on HTTP transmission, and leaves request formation

and response interpretation up to the user. I think ultimately this

approach is better than providing one-off modifications of the

original urlread function to suit a particular need. urlread2 and

supporting files can be found on the Matlab File

Exchange.

Related posts:

Inactive Control Tooltips &

Event Chaining – Inactive Matlab uicontrols cannot normally display

their tooltips. This article shows how to do this with a

combination of undocumented Matlab and Java hacks....

GUI automation using a Robot –

This article explains how Java's Robot class can be used to

programmatically control mouse and keyboard actions...

Matlab installation woes –

Matlab has some issues when installing a new version. This post

discusses some of them and how to overcome them....

Matlab-Java memory leaks,

performance – Internal fields of Java objects may leak memory -

this article explains how to avoid this without sacrificing

performance. ...

File deletion memory leaks,

performance – Matlab's delete function leaks memory and is also

slower than the equivalent Java function. ...

JGraph in Matlab figures –

JGraph is a powerful open-source Java library that can easily be

integrated in Matlab figures. ...

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值