python爬虫之Scrapy介绍五——下载中间件和随机设置header和ip

    技术2024-10-18  4

    本篇博文主要介绍Scrapy框架里面的下载中间件,在middlewares.py文件中可以找到相应的代码(class GithubDownloaderMiddleware)。并且通过修改中间件的代码可以随机设置请求头和ip。下面我们会先介绍下载中间件的代码内容,然后讲如何随机设置header和ip

    1 下载中间件

    下面是下载中间件的代码

    class GithubDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)

    我们主要着重介绍def process_request(self, request, spider)、 def process_response(self, request, response, spider)。

    def process_request(self, request, spider): 当每个 request 通过下载中间件时,该⽅法被调用

    def process_response(self, request, response, spider): 当下载器完成 http 请求,传递响应给引擎的时候调用

    def process_exception(self, request, exception: 当request、response被stop,或者遇到其他异常时候被调用

    1.1 process_request(self, request, spider)

    当每个Request对象经过下载中间件时会被调用,优先级越高的中间件,越先调用;该方法应该返回以下对象: None/Response对象/Request对象/抛出lgnoreRequest异常 process_request代码如下:

    def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None

    返回None: scrapy会继续执行其他中间件相应的方法;

    返回Response对象: scrapy不会再调用其他中间件process__request方法,也不会去发起下载,而是直接返回该Response对象

    返回Request对象: scrapy不会再调用其他中间件的process__request()方法,而是将其放置调度器待调度下载

    如果这个方法抛出异常,则会调用process_ exception方法

    1.2 process_response(self, request, spider)

    当每个Response经过下载中间件会被调用,优先级越高的中间件,越晚被调用。 与process_ request()相反;该方法返回以下对象: Response对象/Request对象/抛出IgnoreRequest异常。

    返回Response对象: scrapy会 继续调用其他中间件的process_ response方法;

    返回Request对象:停止中间器调用,将其放置到调度器待调度下载;

    抛出IgnoreRequest异常: Request.errback会被调用来处理函数,如果没有处理,它将会被忽略且不会写进日志。

    2 设置随机header和ip

    爬⾍在频繁访问⼀个页面的时候,这个请求如果⼀直保持⼀致。那么很容易被服务器发现,从而禁止掉这个请求头的访问。因此我们要在访问这个页面之前随机的更改请求头,这样才可以避免爬虫被抓。随机更改请求头,可以在下载中间件实现。在请求发送给服务器之前,随机的选择⼀个请求头。这样就可以避免总使用⼀个请求头。 同理于,ip地址。

    2.1 随机设置请求头

    测试请求头网址: http://httpbin.org/user-agent 拥有大量请求头的网址:http://www.useragentstring.com/pages/useragentstring.php? typ=Browser 下面开始讲步骤: (1)在middlewares文件修改补充代码

    # 定义随机请求头 import random class RandomUserAgent(object): def process_request(self, request, spider): # print(request) # 打印出响应码 # print('==========') # print(spider) # 打印出对象 # 获取随机请求头 user_agent = random.choice(spider.settings['USER_AGENTS']) request.headers['user_agent'] = user_agent # print(user_agent) # 从响应中获取请求头信息 class CheckUserAgent(object): def process_response(self, request, response, spider): print(request.headers['user_agent']) return response

    (2)setting文件 1)补充请求头,供middlewares文件选择

    USER_AGENTS = ["Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"]

    2)打开middleware 设置

    # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'ua.middlewares.UaDownloaderMiddleware': 543, 'ua.middlewares.RandomUserAgent': 544, 'ua.middlewares.CheckUserAgent': 545, }

    (3)爬虫文件代码

    import scrapy class UseragentSpider(scrapy.Spider): name = 'useragent' start_urls = ['http://httpbin.org/user-agent'] # 可以显示请求头是多少的网址,访问这一个 def parse(self, response): # print(response.text) pass

    2.2 随机设置ip

    同理于随机设置请求头 (1)中间件代码

    # 定义随机ip import random class RandomIp(object): def process_request(self, request, spider): proxy = random.choice(spider.settings['PROXY']) request.meta['proxy'] = proxy ## 在setting中proxy = 'http:xxxx:端口号' print(proxy) # 从响应中获取IP信息 class CheckIp(object): def process_response(self, request, response, spider): print(request.meta['proxy']) return response

    (2)setting 1)设置ip信息

    PROXY = ['http:xxxx:端口号1','http:xxxx:端口号2','http:xxxx:端口号3' ] # 自己去设置多个

    2)设置优先级

    # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'ua.middlewares.UaDownloaderMiddleware': 543, 'ua.middlewares.RandomUserAgent': 544, 'ua.middlewares.CheckUserAgent': 545, 'ua.middlewares.RandomIp': 546, 'ua.middlewares.CheckIp': 547 }

    (3)爬虫文件代码

    import scrapy class UseragentSpider(scrapy.Spider): name = 'useragent' # start_urls = ['http://httpbin.org/user-agent'] start_urls = ['http://httpbin.org/ip'] # 打印出ip 地址的网址 def parse(self, response): # print(response.text) pass
    Processed: 0.075, SQL: 9