疫情下的舆情预测系统工作报告part.4

    技术2022-07-11  85

    项目地址:

    https://github.com/Lee991211/Innovation_training.git

    数据清洗

    当获取了大量的微博数据,需要对冗余数据进行清洗,使数据满足一定格式,以达到模型训练的要求@杨涛同学。当然这个任务相对于爬取就比较简单了,出于保存数据备份的想法,我的清洗脚本分为两步。

    wash:

    import pandas as pd data1 = pd.read_csv("keyword.csv") #data2 = pd.read_csv("Aprilplus.csv") data1.drop(data1.columns[[0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16]], axis=1, inplace=True) #data2.drop(data2.columns[[0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16]], axis=1, inplace=True) df1 = pd.DataFrame(data1) #df2 = pd.DataFrame(data2) df1.to_csv('keywordtemp.csv', index=None) #df2.to_csv('washAprilPlus.csv', index=None)

    wash可以将微博爬取的原文件去掉多余的列,只留下微博正文和发布时间两列。

    wash2:

    import pandas as pd import csv data1 = pd.read_csv("keywordtemp.csv",) # data2 = pd.read_csv("washAprilPlus.csv",) data1['发布时间'] = data1['发布时间'].str.split(' ', expand=True)[0] # data2['发布时间'] = data2['发布时间'].str.split(' ', expand=True)[0] data1['发布时间'] = data1['发布时间'].str.split('-', expand=True)[1] + '/' + data1['发布时间'].str.split('-', expand=True)[2] # data2['发布时间'] = data2['发布时间'].str.split('-', expand=True)[1] + '/' + data2['发布时间'].str.split('-', expand=True)[2] count = 1 result = [] temp = [] for row in data1.index: if data1.loc[row].values[1] == '05/01': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/02': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/03': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/04': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/05': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/06': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/07': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/08': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/09': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/10': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/11': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/12': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/13': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/14': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/15': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/16': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/17': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/18': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/19': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/20': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/21': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/22': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/23': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/24': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/25': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/26': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/27': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/28': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/29': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/30': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break for row in data1.index: if data1.loc[row].values[1] == '05/31': temp.append(data1.loc[row].values) count = count + 1 if count == 16: result = result + temp temp = [] count = 1 break with open('keywordfinal.csv', 'w', encoding='utf-8', newline='') as csvfile: writer = csv.writer(csvfile) for row in result: writer.writerow(row)

    wash2能够将wash清洗后的文件截取每天15条(可自编辑)

    观察者网

    观察者网首页的爬取没有微博数据那么复杂,只需要获取网页中的url、author、title、publish_time、content数据并存到数据库中就ok 爬取部分:

    def spider(self): while not self.url_queue.empty(): item = {} url = self.url_queue.get() if self.check_url(url): print('正在爬取{url}') response = requests.get(url, headers=headers) response.encoding = "utf-8" html = etree.HTML(response.text) results = html.xpath('//ul/li[contains(@class,"left left-main")]') for result in results: item['url'] = url author = result.xpath('./ul/li/div[contains(@class,author-intro)]/p/a/text()') if not author: author = html.xpath('//div[contains(@class,"time")]/span[3]/text()') if not author: self.get_news(response.text, item) continue item['author'] = author[0] item['title'] = result.xpath('./h3/text()')[0] item['publish_time'] = result.xpath('./div[contains(@class,"time")]/span[1]/text()')[0] content = result.xpath('./div[contains(@class,"content")]/p/text()') content = ''.join(content) content = re.sub('\s', '', content) item['content'] = content self.save(item)

    存储部分:

    def save(self, item): self.cursor.execute(self.sql, [item['title'], item['author'], item['publish_time'], item['content'], item['url']]) self.cnn.commit()

    为了避免重复爬取相同的数据,需要在爬取时检查url是否已存在于本地中,于是addurl和checkurl方法产生了

    def check_url(self, url): if url in self.urls: print('{url}已存在') return False else: self.urls.append(url) return True def add_urls(urls, queue): for url in urls: url = 'https://www.guancha.cn' + url queue.put(url)

    结语

    这次项目比较有现实意义,尤其是在2020年的现在更是如此。我们已经进入大数据时代,用数据说话成为研究社会科学的有效手段。任何系统的分析模型训练都需要数据的获取,因此爬取数据成为了一个数据分析系统最底层的部分,下一步希望能将爬虫脚本自动化、并行化运行。

    Processed: 0.017, SQL: 9