疫情下的舆情预测系统工作报告part.2

技术2022-07-10 152

项目地址：

https://github.com/Lee991211/Innovation_training.git

scrapy模块

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。其最初是为了页面抓取（更确切来说，网络抓取）所设计的。也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。

scrapy整体架构大体如下：我们在爬取项目所需要的数据时，只需要自定义数据即可，非常方便。

爬虫的编写

P.S 推荐有scrapy编程需要的小伙伴去看scrapy的中文文档，讲的很详细。

https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

初始化项目

scrapy startproject weibo

该命令将会创建包含下列内容的weibo目录:

weibo/

scrapy.cfg weibo/ __ init __.py items.py pipelines.py settings.py spiders/

__ init __.py …

这些文件分别是:

scrapy.cfg : 项目的配置文件weibo/: 该项目的python模块。之后您将在此加入代码。weibo/items.py: 项目中的item文件.weibo/pipelines.py: 项目中的pipelines文件.weibo/settings.py: 项目的设置文件.weibo/spiders/: 放置spider代码的目录.

爬取数据的定义

观察微博的数据，我们需要获取微博里的发帖id、正文、时间、地点等等信息，遂在item.py中定义相应字段

class WeiboItem(scrapy.Item): # define the fields for your item here like: id = scrapy.Field() bid = scrapy.Field() user_id = scrapy.Field() screen_name = scrapy.Field() text = scrapy.Field() article_url = scrapy.Field() location = scrapy.Field() at_users = scrapy.Field() topics = scrapy.Field() reposts_count = scrapy.Field() comments_count = scrapy.Field() attitudes_count = scrapy.Field() created_at = scrapy.Field() source = scrapy.Field() pics = scrapy.Field() video_url = scrapy.Field() retweet_id = scrapy.Field()

爬虫的编写

Spider是用户编写用于从单个网站(或者一些网站)爬取数据的类。其包含了一个用于下载的初始URL，如何跟进网页中的链接以及如何分析页面中的内容，提取生成 item 的方法。为了创建一个Spider，您必须继承scrapy.Spider 类，且定义以下三个属性: name: 用于区别Spider。该名字必须是唯一的，您不可以为不同的Spider设定相同的名字。 start_urls: 包含了Spider在启动时进行爬取的url列表。因此，第一个被获取到的页面将是其中之一。后续的URL则从初始的URL获取到的数据中提取。 parse() 是spider的一个方法。被调用时，每个初始URL完成下载后生成的Response 对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(response data)，提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。

在本项目中start_url被命名为base_url

属性的定义

name = 'search' # 爬虫名称 allowed_domains = ['weibo.com'] # 允许爬取的域名 settings = get_project_settings() # 获取设置文件 keyword_list = settings.get('KEYWORD_LIST') # 爬取关键词 weibo_type = util.convert_weibo_type(settings.get('WEIBO_TYPE')) # 微博类型 contain_type = util.convert_contain_type(settings.get('CONTAIN_TYPE')) # 微博内容 regions = util.get_regions(settings.get('REGION')) # 微博发布地区 base_url = 'https://s.weibo.com' start_date = settings.get('START_DATE', datetime.now().strftime('%Y-%m-%d')) end_date = settings.get('END_DATE', datetime.now().strftime('%Y-%m-%d')) mongo_error = False pymongo_error = False mysql_error = False pymysql_error = False

echo的编写

我们具体要向页面请求获得哪些数据，要根据setting中的设置来确认。 setting：

KEYWORD_LIST = ['keyword'] # 要搜索的关键词列表，可写多个 WEIBO_TYPE = 1 # 要搜索的微博类型 CONTAIN_TYPE = 0 # 筛选结果微博中必需包含的内容 REGION = ['全部'] # 筛选微博的发布地区 START_DATE = '2020-05-08' # 搜索的起始日期 END_DATE = '2020-05-24' # 搜索的终止日期 IMAGES_STORE = './' # 图片文件存储路径 FILES_STORE = './' # 视频文件存储路径

request：在确定推送请求的时候，可以将请求分为带有地区的数据和没有地区限制的数据，最终敲定一个字典数据，包含url、关键词、省份（or not）

def start_requests(self): start_date = datetime.strptime(self.start_date, '%Y-%m-%d') end_date = datetime.strptime(self.end_date, '%Y-%m-%d') + timedelta(days=1) start_str = start_date.strftime('%Y-%m-%d') + '-0' end_str = end_date.strftime('%Y-%m-%d') + '-0' for keyword in self.keyword_list: if not self.settings.get('REGION') or '全部' in self.settings.get( 'REGION'): base_url = 'https://s.weibo.com/weibo?q=%s' % keyword url = base_url + self.weibo_type url += self.contain_type url += '×cope=custom:{}:{}'.format(start_str, end_str) yield scrapy.Request(url=url, callback=self.parse, meta={ 'base_url': base_url, 'keyword': keyword }) else: for region in self.regions.values(): base_url = ( 'https://s.weibo.com/weibo?q={}®ion=custom:{}:1000' ).format(keyword, region['code']) url = base_url + self.weibo_type url += self.contain_type url += '×cope=custom:{}:{}'.format(start_str, end_str) # 获取一个省的搜索结果 yield scrapy.Request(url=url, callback=self.parse, meta={ 'base_url': base_url, 'keyword': keyword, 'province': region })

parse() 对于parse()函数，如果页面为空，返回页面空提醒，但实际操作中发现，有些关键词设置了反爬措施，也会提示空页面。（比如“疫和谐情”）另一个步骤是确定爬取的时间范围，日期格式采用年-月-日

def parse(self, response): base_url = response.meta.get('base_url') keyword = response.meta.get('keyword') province = response.meta.get('province') is_empty = response.xpath( '//div[@class="card card-no-result s-pt20b40"]') page_count = len(response.xpath('//ul[@class="s-scroll"]/li')) if is_empty: print('当前页面搜索结果为空') elif page_count < 50: # 解析当前页面 for weibo in self.parse_weibo(response): self.check_environment() yield weibo next_url = response.xpath( '//a[@class="next"]/@href').extract_first() if next_url: next_url = self.base_url + next_url yield scrapy.Request(url=next_url, callback=self.parse_page, meta={'keyword': keyword}) else: start_date = datetime.strptime(self.start_date, '%Y-%m-%d') end_date = datetime.strptime(self.end_date, '%Y-%m-%d') while start_date <= end_date: start_str = start_date.strftime('%Y-%m-%d') + '-0' start_date = start_date + timedelta(days=1) end_str = start_date.strftime('%Y-%m-%d') + '-0' url = base_url + self.weibo_type url += self.contain_type url += '×cope=custom:{}:{}&page=1'.format( start_str, end_str) # 获取一天的搜索结果 yield scrapy.Request(url=url, callback=self.parse_by_day, meta={ 'base_url': base_url, 'keyword': keyword, 'province': province, 'date': start_str[:-2] })

parse_by_day() parse_by_day即以天数为单位筛选，更小的单位还有parse_by_hour

def parse_by_day(self, response): """以天为单位筛选""" base_url = response.meta.get('base_url') keyword = response.meta.get('keyword') province = response.meta.get('province') is_empty = response.xpath( '//div[@class="card card-no-result s-pt20b40"]') date = response.meta.get('date') page_count = len(response.xpath('//ul[@class="s-scroll"]/li')) if is_empty: print('当前页面搜索结果为空') elif page_count < 50: # 解析当前页面 for weibo in self.parse_weibo(response): self.check_environment() yield weibo next_url = response.xpath( '//a[@class="next"]/@href').extract_first() if next_url: next_url = self.base_url + next_url yield scrapy.Request(url=next_url, callback=self.parse_page, meta={'keyword': keyword}) else: start_date_str = date + '-0' start_date = datetime.strptime(start_date_str, '%Y-%m-%d-%H') for i in range(1, 25): start_str = start_date.strftime('%Y-%m-%d-X%H').replace( 'X0', 'X').replace('X', '') start_date = start_date + timedelta(hours=1) end_str = start_date.strftime('%Y-%m-%d-X%H').replace( 'X0', 'X').replace('X', '') url = base_url + self.weibo_type url += self.contain_type url += '×cope=custom:{}:{}&page=1'.format( start_str, end_str) # 获取一小时的搜索结果 yield scrapy.Request(url=url, callback=self.parse_by_hour_province if province else self.parse_by_hour, meta={ 'base_url': base_url, 'keyword': keyword, 'province': province, 'start_time': start_str, 'end_time': end_str })

parse_by_hour():

def parse_by_hour(self, response): """以小时为单位筛选""" keyword = response.meta.get('keyword') is_empty = response.xpath( '//div[@class="card card-no-result s-pt20b40"]') start_time = response.meta.get('start_time') end_time = response.meta.get('end_time') page_count = len(response.xpath('//ul[@class="s-scroll"]/li')) if is_empty: print('当前页面搜索结果为空') elif page_count < 50: # 解析当前页面 for weibo in self.parse_weibo(response): self.check_environment() yield weibo next_url = response.xpath( '//a[@class="next"]/@href').extract_first() if next_url: next_url = self.base_url + next_url yield scrapy.Request(url=next_url, callback=self.parse_page, meta={'keyword': keyword}) else: for region in self.regions.values(): url = ('https://s.weibo.com/weibo?q={}®ion=custom:{}:1000' ).format(keyword, region['code']) url += self.weibo_type url += self.contain_type url += '×cope=custom:{}:{}&page=1'.format( start_time, end_time) # 获取一小时一个省的搜索结果 yield scrapy.Request(url=url, callback=self.parse_by_hour_province, meta={ 'keyword': keyword, 'start_time': start_time, 'end_time': end_time, 'province': region })

parse_page

def parse_page(self, response): """解析一页搜索结果的信息""" keyword = response.meta.get('keyword') is_empty = response.xpath( '//div[@class="card card-no-result s-pt20b40"]') if is_empty: print('当前页面搜索结果为空') else: for weibo in self.parse_weibo(response): self.check_environment() yield weibo next_url = response.xpath( '//a[@class="next"]/@href').extract_first() if next_url: next_url = self.base_url + next_url yield scrapy.Request(url=next_url, callback=self.parse_page, meta={'keyword': keyword})

Processed: 0.008, SQL: 10