python爬虫之Scrapy CrawlSpider说明——以阳光平台为例

技术2022-07-11 127

本博客通过采用两种方式来爬取阳光平台，第一种方式是 Scrapy Spider 方式爬取，第二种是Scrapy CrawlSpider爬取。相比于第一种，第二种更加简洁方便。通过比较可以突出Scrapy CrawlSpider 适用于在主页中寻找下一页的URL地址或者内容的URL地址上

1 Scrapy Spider 爬取阳光平台

前面步骤（如创建爬虫项目、setting、pipeline、item等的设置）将不在赘述，可以参考之前的博文。下面的文件主要是爬虫逻辑的搭建的代码展示。

import scrapy from yangguang.yangguan.items import YangguanItem ''' 需求：爬取阳光平台上的：事件标题（title），事件链接（href），事件内容（content） ''' class DemoSpider(scrapy.Spider): name = 'demo' allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1'] def parse(self, response): li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li') for li in li_list: item = YangguanItem() # 初始化 item['title'] = li.xpath('./span[3]/a/text()').extract_first() # 标题 item['href'] ='http://wz.sun0769.com/' + li.xpath('./span[3]/a/@href').extract_first() # 详情页地址 yield scrapy.Request(url=item['href'],callback=self.parse_detail,meta={'item':item}) # 获取下一页 next_url = 'http://wz.sun0769.com/' + response.xpath("//div[@class='mr-three paging-box']/a[2]/@href").extract_first() if next_url is not None: yield scrapy.Request(url=next_url,callback=self.parse) def parse_detail(self,response): item = response.meta.get('item') item['content'] = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first() yield item

评价：很大⼀部分时间和代码在寻找下一页的URL地址或者内容的URL地址上，是否可以简洁一点呢？可以通过 Scrapy CrawlSpider 实现思路： 1.从response中提取所有的li标签对应的URL地址 2.⾃动的构造⾃⼰resquests请求，发送给引擎

2 Scrapy CrawlSpider

2.1 基本介绍

生成crawlspider的命令： scrapy genspider -t crawl 爬虫名字域名

2.1.1 LinkExtractors链接提取器

使用LinkExtractors可以不用程序员自己提取想要的url,然后发送请求。这些工作都可以交给LinkExtractors，他会在所有爬的页面中找到满足规则的url，实现自动的爬取。

class scrapy.linkextractors.LinkExtractor( allow = (), deny = (), allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (), tags = ('a','area'), attrs = ('href'), canonicalize = True, unique = True, process_value = None ) ''' 主要参数讲解： allow：允许的url。所有满⾜这个正则表达式的url都会被提取。 deny：禁⽌的url。所有满⾜这个正则表达式的url都不会被提取。 allow_domains：允许的域名。只有在这个⾥⾯指定的域名的url才会被提取。 deny_domains：禁⽌的域名。所有在这个⾥⾯指定的域名的url都不被提取。 restrict_xpaths：严格的xpath。和allow共同过滤链接。 '''

2.1.2 Rule规则类

定义爬虫的规则类。

class scrapy.spiders.Rule( link_extractor, callback = None, cb_kwargs = None, follow = None, process_links = None, process_request = None ) ''' 主要参数讲解: ● link_ extractor: 一个LinkExtractor对象，用于定义爬取规则。 ● callback: 满足这个规则的url，应该要执行哪个回调函数。因为CrawlSpider使用了parse作为回调函数，因此不要覆盖parse作为回调函数自己的回调函数。 ● follow: 指定根据该规则从response中提取的链接是否需要跟进。 ● process_ links: 从link_ extractor中获取到链接后会传递给这个函数，用来过滤不需要爬取的链接。 '''

2.2 Scrapy CrawlSpider 爬取阳光平台

这里还是只展示爬虫文件的代码，其他操作同于之前

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule ''' 1.创建CrawlSpider 创建方式 scrapy genspider -t crawl 爬虫的名字域名 2.CrawlSpider 需要定义回调函数的时候最好找个函数名字不要以parse命名 3.Rule对象什么时候你要follow callback的实现方式 ''' class CygSpider(CrawlSpider): name = 'cyg' allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1'] # 定义提取url地址的规则 rules = ( # LinkExtractor 链接提取器需要提取的url地址 # callback 提取Url地址的response会交给回调函数处理 # follow=True 就是请求连续不断新的url地址 # 列表页 Rule(LinkExtractor(allow=r'http://wz.sun0769.com/political/index/politicsNewest\?id=\d+'),follow=True), # 问号前加转义斜杠，避免误会 # 详情页 Rule(LinkExtractor(allow=r'http://wz.sun0769.com/political/politics/index\?id=\d+'), callback='parse_item'), # 问号前加转义斜杠，避免误会 ) def parse_item(self, response): item = {} #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() #item['name'] = response.xpath('//div[@id="name"]').get() #item['description'] = response.xpath('//div[@id="description"]').get() # 详情页的数据 item['content'] = response.xpath("//div[@class='details-box']/pre/text()").extract_first() print(item) return item

3 爬取小程序名字

仍然是只显示爬虫文件代码

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from WechatAPPSpider.items import WechatappspiderItem ''' 提取微信小程序的程序名字、程序作者、程序日期思路：在主页面中查找小程序的详情页，在详情页查询并获取名字 ''' class WxappSpider(CrawlSpider): name = 'wxapp' allowed_domains = ['wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1'] # 列表页面 # http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=3 # http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=4 # 详情页面 # http://www.wxapp-union.com/article-6101-1.html # http://www.wxapp-union.com/article-6102-1.html rules = ( # 列表页面 Rule(LinkExtractor(allow=r'http://www.wxapp-union.com/portal.php\?mod=list&catid=2&page=\d+'),follow=True), # 详情页面 Rule(LinkExtractor(allow=r'http://www.wxapp-union.com/article-\d+-1.html'), callback='parse_item'), ) def parse_item(self, response): # item = {} #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() #item['name'] = response.xpath('//div[@id="name"]').get() #item['description'] = response.xpath('//div[@id="description"]').get() item = WechatappspiderItem() item['title'] = response.xpath("//h1[@class='ph']/text()").extract_first() item['author'] = response.xpath("//p[@class='authors']/a/text()").extract_first() item['pub_data'] = response.xpath("//p[@class='authors']/span/text()").extract_first() yield item return item

Processed: 0.009, SQL: 9