腾讯招聘信息爬取案例

技术2022-07-12 80

前段时间看了了scrapy，有一个案例是爬取腾讯招聘信息的，当时看了腾讯网站，已经更新换代了，于是自己看了看，写下一个实际可行的demo

首先是观察腾讯招聘首页不在页面element内，但是依旧很容易就可以发现你要的信息

观察url，可以提取出 https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10 pageIndex是页码，pageSize是当页容量

接下来练练手scrapy框架

import scrapy import json class HrSpider(scrapy.Spider): name = 'hr' allowed_domains = ['tencent.com'] start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10'] def __init__(self): self.url = 'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={}&pageSize=10' self.count = 1 def parse(self, response): response = json.loads(response.text) # response.text可以获取response的文本字符串形式 li_list = response["Data"]["Posts"] # 获取当前页数据列表 count = response["Data"]["Count"] # 获取数据总计数 for li in li_list: item = {} item["title"] = li["RecruitPostName"] item["position"] = li["CountryName"] + li["LocationName"] item["cate"] = li["CategoryName"] item["public_info"] = li["Responsibility"] yield item # 构造下一页地址 if self.count*10 <= count: self.count += 1 next_url = self.url.format(self.count) yield scrapy.Request(next_url, callback=self.parse)

然后再在pipelines.py添加数据处理，写进mongodb

from itemadapter import ItemAdapter from pymongo import MongoClient # 创建数据库实例对象 client = MongoClient(host="192.168.152.145", port=27017) collection = client["tencent"]["hr"] class MyspiderPipeline: def process_item(self, item, spider): collection.insert(item) print(item) return item

最后就可以在数据库查看我们存储的信息了，很简单，有没有。

Processed: 0.011, SQL: 9

腾讯招聘信息 爬取案例

腾讯招聘信息爬取案例