The previous article we introduced the Python crawler frame Scrapy basic data, such as the installation and configuration, in this paper, we take a look at how to use convenient Scrapy frame grab a site's content, just choose a way station ( to the sample

web crawler, data fetching program is conducted online, use it to grab a particular web page HTML data. Although we use some libraries develop a crawlers, but using a framework can greatly improve the efficiency, shorten the development time. Scrapy is a Python code, lightweight, simple and lightweight, and very convenient to use. Using Scrapy can easily complete online data collection work, it completed a lot of work for us, without the need for a great effort to develop myself.

first to answer a question first.
Q: your site into the crawler, a total of a few steps?
The answer is simple, step four:
New Project (Project) : create a new crawler Project
Clear objectives (Items) : clear you want to grab the goal of
Crawler (spiders) : make the crawler began to crawl web page
Store content (Pipeline) : design pipe storage crawl content

ok, now that determines the basic process, then the next step by step can be completed.

1. The new Project (Project)
Hold down the Shift key in the empty directory right-click, select "here to open a command window", enter the command:

code is as follows:

Scrapy startproject tutorial 

the tutorial for the project name.
Can see that will create a tutorial folder, a directory structure is as follows:

code is as follows:

      Scrapy. Cfg 
              Just set py 
              The items. Py 
              Pipelines. Py 
              Settings. Py 
                      Just set py 

here is a brief introduction of the role of each file:
Scrapy. CFG: project configuration file
Tutorial/: Python module of the project, will be the reference code from here
Tutorial/items. Py: project items file
Tutorial/pipelines. Pipelines of py: project file
Tutorial/Settings. Py: project Settings file
Tutorial/spiders/: the storage directory of the crawler

2. Clear objectives (Item)
Items in Scrapy is used for loading containers of scraping content, a bit like the Dic in Python, is a dictionary, but provides some additional protection to reduce errors.
Generally speaking, the item can use scrapy. Item. The item class to create, and use scrapy. Item. The Field object to define the properties (understandable as similar to ORM mapping).
Next, we started to build the item model (the model).
First of all, we want to content:
Name (name)
Link (url)
Description (description)

modify the tutorial directory of the items. The py files, behind the original class to add our own class.
Because want to grasp the content of the web site, so we can name it DmozItem:

code is as follows:

#Define here the models for scraped items 

#See documentation: in  
The from scrapy. Item import item, Field 
The class TutorialItem (Item) :  
      #define the fields for the item here like:  
      #name=Field ()  
The class DmozItem (Item) :  
      Title=Field ()  
      The link=Field ()  
      Desc=Field ()  

at first may seem have some look not to understand, but define these item can make with other components when you know what your items.
Can put the Item simple as encapsulated class object.

3. Make the crawler (spiders)

the crawler, overall in two steps: first to climb again.
That is to say, first you need to get all the content of the whole page, and then take out the parts that are useful to you.
3.1 climb
Spiders is the user to write your own class, used to from one domain to grab information group (or domain).
They define the link for download URL list and tracking scheme, analytical way of web content, in order to extract the items.
To build a Spider, you must use scrapy. Spiders. BaseSpider create a subclass, and identify three mandatory attributes:
Name: identify the name of the crawler, must be unique, you must be defined in different crawler different names.
Start_urls: list of URL crawl. Crawler began to pull data from here, so, for the first time downloaded data will start from these urls. Other sub URL will be inherited from the starting URL generation.
Parse () : analytical method, called the incoming from every URL back to the Response object as the only parameter, is responsible for parsing and matching fetching data for item (parsing), tracking more URL.
Here you can refer to the width of the crawler thoughts which are mentioned in the tutorial to help you understand, tutorial teleport: [Java] zhihu jaw set 5: use HttpClient kit and the width of the crawler.
Is this store Url down and as a starting point gradually spread, grab all eligible web Url stored continues to climb.

let's write the first crawler, named dmoz_spider. Py, preserved in the tutorial \ spiders directory.
Dmoz_spider. Py code is as follows:

code is as follows:

The from scrapy. Spiders import Spider 
The class DmozSpider (spiders) :  
      Allowed_domains=(" ")  
      Def parse (self, response) :  
              Filename=response. Url. The split ("/") [2]  
              Open (filename, 'wb). Write (response. Body)  

allow_domains range is the search domain, also is the crawler constraint region, only rules creeper crawled web pages under this domain.
Can be seen from the parse function, remove links to the last two address as stored in the file name.
Then run it and see, hold down the shift in the tutorial directory, right click here to open a command window, enter:

code is as follows:

Scrapy crawl dmoz 

run results as shown in figure:

UnicodeDecodeError: 'ASCII' codec can 't decode byte 0 xb0 in position 1: ordinal not in range (128)
Running the first Scrapy project is an error, it is doomed.
Should be out of the coding issues, Google found the solution:
In python Lib \ site - packages folder to create a new sitecustomize. Py:

code is as follows:

The import sys     
Sys. Setdefaultencoding (' gb2312)      

run again, OK, problem is solved, and have a look at the results:

the last line INFO: Closing spiders (finished) shows that the crawler has run successfully, and shut down on its own.
Contains [dmoz] line, that corresponds to our crawler running results.
Can see start_urls every URL defined in Japan shiyuki.
Do you remember our start_urls?
Because the URL is the starting page, so they have no reference (referrers), so you will see in them at the end of each line (referer: <None>) .
Under the action of the parse method, two files are created: Books and Resources, these two files in the URL of the page content.

so what happened in the thunder and lightning just?
First of all, for crawler Scrapy start_urls attributes of each URL creates a Scrapy.. HTTP Request object, and crawler parse method specified as a callback function.
Then the Request are scheduling and execution, and then through the parse () method returns the scrapy. HTTP. The Response object, and feedback to the crawler.

3.2 take

Crawl across the web, the next is in process.
Optical storage of an entire web page or inadequate.
In the foundation of crawler, this step can use regular expressions to grasp.
In Scrapy, called using an XPath selectors mechanism, which is based on an XPath expression.
If you want to learn more selectors and other mechanisms: you can refer to data points I I

this is some examples of the XPath expression and meaning of their
/HTML/head/title: select HTML document <Head> Elements below <Title> The label.
/HTML/head/title/text () : select the aforementioned <Title> Elements in the following text content
//td: select all <Td> The element
[@//div class="mine"] : select all containing div class="mine" attributes tag element
This is just a few simple example of using XPath, but in fact the XPath is very powerful.
You can refer to the W3C tutorial: points I me.

in order to facilitate use XPaths, Scrapy provide XPathSelector class, there are two choice, HtmlXPathSelector (parses HTML data) and XmlXPathSelector (XML data analysis).
Must pass a Response object instantiated operations on them.
You will find that the Selector objects show node of the document structure. Therefore, the first instantiate the selector will be related to the root node or an entire directory.
Inside the Scrapy, Selectors have four basic methods (click to view the API documentation) :
The xpath () : returns a series of selectors, each a select said an xpath expression to select parameters of node
CSS () : returns a series of selectors, each a select said a CSS parameter expression selects node
Extract () : returns a unicode string, for the selected data
Re () : returns a string of a unicode string, for the use of regular expressions of fetching out content

xpath 3.3 experimentLet's try the Selector in the Shell.
The experiment site:

familiar experimental mice, the next is to use the Shell crawl web pages.
Into the top-level directory of the project, which is the first layer under the tutorial folder, enter in the CMD:

code is as follows:

Scrapy shell;

enter after you can see the following content:

after the Shell loaded, you will receive the response, stored in a local variable in the response.
So if you type in the response body, you will see the response of the body parts, that is grasping the content of the page:


or input response headers to check the header part of it:

now is like having a lot of sand in his hand, hidden inside we want to gold, so the next step is to use a sieve shake twice, take the impurities out, select the key content.
The selector is like a sieve.
In the old version, the two Shell instantiated selectors, one is to parse HTML HXS variables, one is to parse the XML XXS variables.
And now we have prepared the Shell of the selector, sel, can automatically depending on the type of the returned data to choose the best analytical solution (XML or HTML).
Then let us have a crunching! ~
To thoroughly understand this problem, first of all to know, what on earth is caught page.
We want to grab the title of the page, for example, namely <Title> This tag:

you can enter:

code is as follows:

Sel. Xpath ('//title)  


so you can put the label out, with the extract () and text () can also do further processing.
Note: list some useful xpath simple path expression:
Expression   Description
Nodename  Select all the child nodes of this node.
/  From the root node selection.
//  Selecting from matching the current node selection in the document, regardless of their position.
.   Select the current node.
..   Select the parent node of the current node.
@   Select properties.
All the results are as follows, In [I] said the experiments I input, the Out [I] said the results of the first time I output (recommended reference: W3C tutorial) :

code is as follows:

In [1] : sel. Xpath ('//title)  
Out [1] : [<the Selector xpath='//title data=' <title> the Open Directory - Computers: Progr '>]  
In [2] : sel. Xpath ('//title) extract ()  
Out [2] : [u '<title> the Open Directory - Computers: Programming: Languages: Python: Books']  
In [3] : sel. Xpath ('//title/text () ')  
Out [3] : [<the Selector xpath='//title/text ()' data= 'Open Directory - Computers: Programming:' >]  
In [4] : sel. Xpath ('//title/text () '). The extract ()  
Out [4] : [u 'Open Directory - Computers: Programming Languages: Python: Books']  
In [5] : sel. Xpath ('//title/text () '). The re (' (\ w +) : ')  
Out [5] : [u 'Computers', u' Programming ', u 'Languages', u' Python ']  

of course title this tag don't have much value for us, let's to really grab some meaningful things.
Using firefox review elements we can see clearly that what we need is as follows:

we can use the following code to grab the <Li> Tags:

code is as follows:

Sel. Xpath ('//ul/li)  

from <Li> Label, can obtain the description of the site:

code is as follows:

Sel. Xpath ('//ul/li/text () '). The extract ()  

can obtain the title of the website like this:

code is as follows:

Sel. Xpath ('//ul/li/a/text () '). The extract ()  

it can be for website hyperlink:

code is as follows:

Sel. Xpath ('//ul/li/a/@ href '). The extract ()  

of course, in front of these examples is the method of direct access attributes.
We note that the xpath returns an object list,
Then we can directly call the attributes of the object in the list to dig deeper node
(reference: Nesting selectors andWorking with relative XPaths in the selectors) :
Sites=sel. Xpath ('//ul/li ')
For site in the sites:
      Title=site. Xpath (' a/text () '). The extract ()
      Link=site. Xpath (' a/@ href) extract ()
      Desc=site. Xpath (' text () '). The extract ()
      Print the title, link, desc

xpath 3.4 practical
We have done so long with the shell of actual combat, finally we can put the study to the contents of the application to dmoz_spider the crawler.
In the original crawler parse function make the following changes:

code is as follows:

The from scrapy. Spiders import Spider 
The from scrapy. The selector import Selector 
The class DmozSpider (spiders) :  
      Allowed_domains=(" ")  
      Def parse (self, response) :  
              Sel=the Selector (response)  
              Sites=sel. Xpath ('//ul/li)  
              For site in the sites:  
                      Title=site. Xpath (' a/text () '). The extract ()  
                      Link=site. Xpath (' a/@ href) extract ()  
                      Desc=site. Xpath (' text () '). The extract ()  
                      Print title 

From scrapy.

note, we import the selector in the selector class, object and instantiate a new selector. So that we can operate it like a Shell xpath.
Let us try to enter the command to run the crawler () in the root directory of tutorial:

code is as follows:

Scrapy crawl dmoz

run results are as follows:

indeed, successful caught all of the title. But don't look for ah, how Top, Python the navigation bar is fetching out?
We only need the contents of the red circle:

seems to be the xpath statement we have a problem, not only we need the project name of fetching out, also caught some innocent but xpath syntax elements of the same.
We found that we need to review elements <Ul> With class='directory - url' attributes,
So as long as the xpath statement to sel. Xpath ('//ul/@ class="directory - url"]/li ') can be
Adjust the xpath statement to do the following:

code is as follows:

The from scrapy. Spiders import Spider 
The from scrapy. The selector import Selector 
The class DmozSpider (spiders) :  
      Allowed_domains=(" ")  
      Def parse (self, response) :  
              Sel=the Selector (response)  
              Sites=sel. Xpath ('//ul/@ class="directory - url"]/li ')  
              For site in the sites:  
                      Title=site. Xpath (' a/text () '). The extract ()  
                      Link=site. Xpath (' a/@ href) extract ()  
                      Desc=site. Xpath (' text () '). The extract ()  
                      Print title 

successfully seized all the headlines, there is absolutely no innocent:

3.5 using the Item

Let's take a look at how to use the Item.
We have said, in front of the Item object is custom python dictionary, you can use the standard dictionary syntax for a property value:

code is as follows:

>>>The item=DmozItem ()  
>>>Example item [' title ']='title'  
>>>The item [' title ']  
'Example title'  

as a reptile, Spiders hoping to grab data stored in the Item object. In order to return our data is fetched, spiders final code should be like this:

code is as follows:

The from scrapy. Spiders import Spider 
The from scrapy. The selector import Selector 
The from tutorial. The items import DmozItem 
The class DmozSpider (spiders) :  
      Allowed_domains=(" ")  
      Def parse (self, response) :  
              Sel=the Selector (response)  
              Sites=sel. Xpath ('//ul/@ class="directory - url"]/li ')  
              The items=[]  
              For site in the sites:  
                      The item=DmozItem ()  
                      The item [' title ']=site. Xpath (' a/text () '). The extract ()  
                      Item [' link ']=site. Xpath (' a/@ href) extract ()  
                      The item [' desc ']=site. Xpath (' text () '). The extract ()  
                      The items. Append (item)  
              Return items 

4. Storage content (Pipeline)
Save the information of the simplest method is through the Feed exports, mainly has four kinds: JSON, JSON lines, CSV, XML.
We will result in the most commonly used JSON export command is as follows:

code is as follows:

Scrapy crawl dmoz - o items. Json - t json 

-o followed by export file name, -t derived type to be back.
Then look at some of the derived results, use a text editor open the json file (for the convenience of display, eliminated except for the title in the item of attribute) :

because this is just a small example, so simple.
If you want to use to grab the items do more complicated things, you can write an Item Pipeline (pipe) entries.
That's we slowly later play ^_^

this is python crawler frame Scrapy process spiders crawling all of the site's content, very detailed, hope can help to you, have the need to also can contact with me, together with progress


  • Python()
  • python
  • python()
  • PythonScrapy
  • python(python)
  • Python
  • pythonURL
  • pythonurllib2
  • pythonHTTP
  • pythonurllib2OpenersHandlers
  • pythonurllib2
  • python
  • python
  • python
  • pythontxt
  • pythonexe
  • python
  • pythonScrapy

This concludes the body part