腾讯招聘信息 爬取案例

    技术2022-07-12  73

    前段时间看了了scrapy,有一个案例是爬取腾讯招聘信息的,当时看了腾讯网站,已经更新换代了,于是自己看了看,写下一个实际可行的demo

    首先是观察腾讯招聘首页不在页面element内,但是依旧很容易就可以发现你要的信息

    观察url,可以提取出 https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10 pageIndex是页码,pageSize是当页容量

    接下来练练手scrapy框架

    import scrapy import json class HrSpider(scrapy.Spider): name = 'hr' allowed_domains = ['tencent.com'] start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10'] def __init__(self): self.url = 'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={}&pageSize=10' self.count = 1 def parse(self, response): response = json.loads(response.text) # response.text可以获取response的文本字符串形式 li_list = response["Data"]["Posts"] # 获取当前页数据列表 count = response["Data"]["Count"] # 获取数据总计数 for li in li_list: item = {} item["title"] = li["RecruitPostName"] item["position"] = li["CountryName"] + li["LocationName"] item["cate"] = li["CategoryName"] item["public_info"] = li["Responsibility"] yield item # 构造下一页地址 if self.count*10 <= count: self.count += 1 next_url = self.url.format(self.count) yield scrapy.Request(next_url, callback=self.parse)

    然后再在pipelines.py添加数据处理,写进mongodb

    from itemadapter import ItemAdapter from pymongo import MongoClient # 创建数据库实例对象 client = MongoClient(host="192.168.152.145", port=27017) collection = client["tencent"]["hr"] class MyspiderPipeline: def process_item(self, item, spider): collection.insert(item) print(item) return item

    最后就可以在数据库查看我们存储的信息了,很简单,有没有。

    Processed: 0.010, SQL: 9