goal and task: crawling Tencent recruitment information, we need to climb the content of the job: name of the job, details of the job link, job category, number of recruits, location of work and release time.


scrapy startproject Tencent

to create a Scrapy project implementation of

command, can create a Tencent folder, "color: style=

structure is as follows: #ff0000 > strong> two, write item file, according to the definition of the content of the climb climb from the field of the

 coding: UTF-8 import -*- # -*- scrapy class TencentItem (scrapy.Item): positionname = scrapy.Field (# position) # (positionlink = scrapy.Field) details of the connection # job category positionType (= scrapy.Field) peopleNum = scrapy.Field (# recruitment) # location workLocation = scrapy.Field (publishTime) # release time = scrapy.Field (



to write the spider file into the Tencent directory, create a reptile using the command:

 # tencentPostion crawler, tencent.com crawler scope of scrapy "tencent.com" genspider tencentPostion 

command will create a tencentPostion.py file in the spiders folder, now to write:

 coding: UTF-8 # -*- -*- import scrapy from tencent.items import TencentItem class TencentpositionSpider (scrapy.Spider): "function: crawling Tencent bidding agency information" # crawler name = "tencentPosition" # crawler range allowed_domains = [url] "tencent.com" = "http://hr.tencent.com/position.php? & start= offset = 0 # the initial url = [url + start_urls  str(offset)]   def parse(self, response):     for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):       # 初始化模型对象       item = TencentItem()       # 职位名称       item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]       # 详情连接       item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]       # 职位类别       item['positionType'] = each.xpath("./td[2]/text()").extract()[0]       # 招聘人数       item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0]       # 工作地点       item['workLocation'] = each.xpath("./td[4]/text()").extract()[0]       # 发布时间       item['publishTime'] = each.xpath("./td[5]/text()").extract()[0]       yield item     i F self.offset < 1680: self.offset = 10 # each processed page data after re send the next page request # self.offset is incremented by 10, while the mosaic for the new URL, and calls the callback function self.parse yield scrapy.Request Response (self.url + str (self.offset), callback = self.parse) 


 # -*- write pipelines file coding: UTF-8 import JSON class TencentPipeline -*- (object):" function: save the item data "and" def __init__ "(self): self.filename = open (" tencent.json "," W ") def process_item (self, item, spider): text = json.dumps (dict (item), ensure_ascii = False) +" n "self.filename.write (text.encode (" UTF-8 ")) re Turn item def close_spider (self, spider): self.filename.close (


 # set the request of the head, adding URL DEFAULT_REQUEST_HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9; Windows NT 6.1; Trident/5.0,'Accept';"'text/html, application/xhtml+xml, application/xml; q=0.9, q=0.8'} * / *; set the item - Pipelines # ITEM_PIPELINES = {'tencent.pipelines.TencentPipeline': 300,}

to execute the command, run the program

 tencentPosition scrapy crwal # crawler tencentPosition

using the CrawlSpider class

 # rewrite create project scrapy startproject TencentSpider # into the project directory. Create a scrapy genspider file, -t crawl Tencent crawler tencent.com item file written unchanged, mainly written # -*- crawler -*- coding:utf-8 import scrapy and Rule from # into CrawlSpider scrapy.spiders import CrawlSpider, Rule # link rules matching, connect the from scrapy.linkextractors import LinkExtractor to extract from TencentSpider.items import TencentItem class in accordance with the rules of TencentSpider (CrawlSpider): name = "Tencent" allow_domains "hr.tencent.com" start_urls = [] = ["http://hr.tencent.com/position.php? & link extraction rules start=0#a" #] Response, return to match rules, link object list pagelink = LinkExtractor (allow= ("start=d+")) = rules [# get this list the links in the In turn, send requests, and continue to follow up, call the specified callback function Rule (pagelink, callback = parseTencent, follow = True)] def parseTencent # specified callback function (self, response): for each in response.xpath ("//tr[@class='even'] //tr[@class='odd'] |"): item = TencentItem (item['positionname']) # position. = each.xpath (./td[1]/a/text) (.Extract) ([0] ") # details connected with item['positionlink'] = each.xpath ("./td[1]/a/@href ").Extract ([0]) # position category item['positionType'] = each.xpath (./td[2]/text) (.Extract) ([0]") # recruitment item['peopleNum'] = each.xpath (./td[3]/text) (.Extract) ([0] # ") place of work ("./td[4]/text (item['workLocation'] = each.xpath ".extract ([0]))) = each.xpath (item['publishTime'] # published"./td[5]/text ") (.Extract) ([0]) yield item

" summarized above is Xiaobian to introduce the Python crawler framework Scrapy code examples, I hope to help you, if you have any questions welcome to my message, Xiao Bian will reply you timely!

This paper fixed link:http://www.script-home.com/python-crawler-framework-scrapy-example-code.html | Script Home | +Copy Link

Article reprint please specify:Python crawler framework Scrapy example code | Script Home

You may also be interested in these articles!