The basic usage of

urllib library

urllib the basic composition of

using the urlopen method the most simple web crawling html

urllib headers library construction of abnormal operation simulation

error browser operation in addition to the basic usage of using the Request method, there are many advanced features, can be more flexible in crawler applications, such as

:

HTTP POST request methods, submit the data to the server user login

using proxy IP to prevent anti climb

setting method to improve the efficiency of

crawler when super analytical URL

this will be detailed analysis and to explain the contents of these.

POST requests

POST as one of the methods of request for the HTTP protocol, and is also a commonly used method for submitting data to the server. The blogger first introduces some of the preparations for the post request, and then gives an example of the detailed analysis of its use and a deeper concept.

POST request preparation,

, since we need to submit information to the server, we need to know where to fill in and fill in and fill in the format. With these questions, let's look down.

also submits user login information (user name and password). Different websites may need different things. For example, Taobao anti crawling mechanism is more complex, and there will be other large additional information. Here, we take bean paste as an example (relatively simple), the goal is to figure out how POST is used, and the complex content will be shared with you later in the actual combat section.

throws up complex information like Taobao. If we only consider user name and password, our preparation is actually to understand what the attribute of user name and password label is and what name is. The following two ways can be implemented.

F12 element

browser view access can also be obtained through the capture tool Fiddler.

crap doesn't say much. Let's see how to find name in the end.

1. browser F12

looks at the browser element by layer by layer (Chrome is used), the name= "form_email" of the mailbox / cell phone number tag, the name= label of the password "form_email", as shown in the red box below.

,

", but it should be noted that the name name of the two tags is not fixed, and the name name on it is only defined by the bean site, not all. Other websites may have different names, such as name= "username", name= "password" and so on. Therefore, you need to see what name is for login for different sites each time.

2. by Fiddler

capture tool "http://files.jb51.net/file_images/article/201801/201801051518002.png" alt= ">

"

blogger recommended the use of fiddler tools, very easy to use. The crawler itself is an analog browser, and we just need to know how the browsers work.

fiddler will help us grab all the contents of browser POST request, so we get information from browser POST, and fill it into the crawler program to simulate browser operation on OK. In addition, it is also convenient to catch the headers of the browser by fiddler.

installs Fiddler's partners: note that the pit of fiddler certificate problem (unable to grab HTTPs package) can be solved through Tools >, Options >, HTTPS, Decrypt Decrypt and modification certificate. Otherwise, it will always show that the capture of Tunnel information package...

is good, completed the preparation work, we have a direct understanding of the code.

POST

 coding: # request using 

UTF-8 import urllib.request import urllib.error import urllib.parse # headers information from the fiddler or your browser can copy headers = {'Accept':'text/html, application/xhtml+ XML, application/xml; q=0.9, image/webp, image/apng, q=0.8','Accept-Language'* / *;'zh-CN, Zh; q=0.9','User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36'} # POST request information, fill in your username and password:'index_nav'value = {'source','form_password':'your password','form_email':'your username' try:} data = urllib.parse.urlencode (value).Enco De ('utf8') response = urllib.request.Request ('https://www.douban.com/', data=data, headers=headers) HTML = urllib.request.urlopen (response) = html.read (result).Decode ('utf8') print (result) except urllib.error.URLError as e: if hasattr (E,'reason'): Print (the reason is' wrong '+ str (e.reason)) except urllib.error.HTTPError as e: if hasattr (E,'code'): Print (encoding is' wrong '+ str (e.code) print (else:)' request success through. '

) results:

 operation < DOCTYPE HTML> < HTML! Lang= "zh-cmn-Hans" > < head> < meta charset= "UTF-8" > < meta; name= "description" content= "provides books, movies, music comparison recommendation, comments and prices, as well as the cultural life unique city. >... Window.attachEvent ('onload', _ga_init);} < /script> < /body> < /html> 

; attention: when copying the bar, please remove this item: "," or "," otherwise, it will prompt the mistake.

POST

code analysis request us to analyze the above code, use the urllib library and request are basically the same, basic usage of urllib request library can reference the Python learn from zero here more than a data crawler, post parameters and some analytical content, emphatically explain.

 data = urllib.parse.urlencode (value).Encode ('utf8') 

, which means the use of urllib Library's parse to analyze the content of the

, why do we need to parse it? This is because the post

to be the encoding format can be sent, and the encoding rules to comply with the RFC standard, Baidu RFC definition, for reference:

Request ForComments (RFC), is a series of scheduled file number. The document collects information about the Internet, and software files for UNIX and the Internet community. The RFC file is currently sponsored by InternetSociety (ISOC). The basic Internet communication protocols are described in detail in the RFC file. The RFC file also adds a number of topics to the standard, such as the protocols for the new development of the Internet and all the records in the development. So almost all the Internet standards are included in the RFC file. The URLEncode method of

and parse is to convert a dictionary or sequential two element tuple into URL query string (in other words, transform the format according to RFC standard). Then the converted string is converted to a binary format by the UTF-8 code to be used.

note: the above is done in the Python3.x environment, the encoding and decoding rules in Python3.x byte - > - > string; byte model, the byte - > string string > byte decoding, encoding for

IP

proxy agent with IP strong>

< / why use the IP? Because a variety of anti crawling mechanisms will detect the frequency of the same IP crawling on the web, if the speed is too fast, it will be identified as the robot sealing off your IP. But too slow speed will affect the speed of crawling. Therefore, we will replace the IP with our proxy IP, so that we can quickly crawl web pages and reduce the detection of robots as we continue to replace the new IP addresses.

also uses the request of urllib to complete the use of proxy IP, but unlike the urlopen used before, we need to create customized opener ourselves. What does it mean?

urlopen is like a universal version of opener. When we need special functions (such as proxy IP), urlopen can't meet our needs, so we have to define and create opener by ourselves. The following method of

request processor, which has various functions of the

 ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor, DataHandler

we want to use the ProxyHandler is the first to deal with the agency problem.

lets us see how a piece of code is used.

 coding:utf-8 import urllib.request import urllib.error import # urllib.parse # headers information can be copied from the fiddler or headers = {'Accept'browser:'text/html, application/xhtml+xml, application/xml; q=0.9, image/webp, image/ apng, q=0.8','Accept-Language'* / *;'zh-CN, Zh; q=0.9','User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/48.0.2564.48 Safari/537.36'} # POST request value = {'source':'index_nav','form_password':'your password','form_email':'your username' IP} # proxy information for the dictionary, key'http', value' IP 'proxy proxy port number: {'http':'115.193.101.21:61234'} = t Ry: data = urllib.parse.urlencode (value).Encode ('utf8') response = urllib.request.Request ('https://www.douban.com/', data=data, headers=headers) # using ProxyHandler method to generate processor object proxy_handler = urllib.request.ProxyHandler (proxy) IP # create a proxy instance of opener opener = urllib.request.build_opener (proxy_handler) # will set up post information and headers response as HTML = opener.open (parameter result = html.read (response)).Decode ('utf8') print (result) except urllib.error.URLError as e: if hasattr (E,'reason'): Print (the reason is' wrong '+ str (e.reason)) except urllib.error.HTTPError as e: if hasattr (E,'code'): Print (encoding is' wrong '+ str (e.code) print (else:)' request success through. Based on the post request code above, 

completes the operation of proxy IP by replacing urlopen with its own opener, and proxy IP can be found on some free proxy IP website.

is all of the contents we have arranged, and thank you for your support for the home of the script.


This concludes the body part