requests.get(url,header=header)向服务器发送GET请求
requests.post(url,header=header)向服务器发送POST请求
示例代码get:
import requests url = "https://down.freembook.com/mixc/" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0', } demo = requests.get(url,headers=headers) print(demo.status_code) print(demo.text) print(demo.content)示例代码post:
import requests url = "https://down.freembook.com/mixc/" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0', } params = { "_token": "P1o8Fz9ZOAuBojBsNGNfPa9vivr5PqRBUFwstL8I", "mobile": '15263819410', "password": "15263819410", "remember": "1" } demo = requests.post(url,data=params,headers=headers)header里面可以装浏览器型号,cookie等信息,伪装成浏览器是最简单的反爬虫方法
url是要爬取的地址
demo.text输出HTML源码,demo.content是输出二进制文件,保存的时候常用到
post多一个参数是data,里面携带参数发送给服务器,开发过网站的很容易理解
requests方法还可以设置session等,是一种登录机制
有很多的数据筛选库,我初学时用的时BeautifulSoup和re,现在觉得不好学,最好理解和好用的我认为是lxml库
安装
pip install lxmllxml使用的是XPath语法
示例代码:
import requests from lxml import etree demo = requests.get(url,headers=headers) demo_address = etree.HTML(demo.text).xpath('//a/@href')#获取所有a标签下的href地址会了这两个库就可以写简单的爬虫了,但是要理解dom树,还有python数据类型。
主要以二进制文件的形式保存,高级一写的是保存在数据库里
示例代码:
demo4 = requests.get(new_url_3, headers=headers) f = open('D:\\' + str(j) + '.mobi', 'wb') # 将下载到的图片数据写入文件 f.write(demo4.content) f.close()单页面爬取并保存:
import requests from lxml import etree def getspider(url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0' } response = requests.get(url, headers=headers,) listpassages=[] if response.status_code == 200: print("网页抓取成功") response.encoding="utf-8"#观察网页编码方式进行编码 response=response.text element = etree.HTML(response) passages = element.xpath(".//dd[@id='contents']//text()") for passage in passages: listpassages.append(passage) return listpassages def save_text(texts): f=open("F:\Python Document\story_1.txt","w",encoding="utf-8") for text in texts: f.write(text) f.close() def main(): url="https://www.23us.so/files/article/html/32/32983/15413313.html" listpassages=getspider(url) save_text(listpassages) # print(len(listpassages)) # print(listpassages[]) # for passage in listpassages: # print(passage) main()这是某小说网站一个单页面,是很早以前对爬虫有点门道时候写的
利用for循环爬取大量数据:
import requests from lxml import etree url = "https://down.freembook.com/mixc/" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0', } demo = requests.get(url,headers=headers) demo_address = etree.HTML(demo.text).xpath('//a/@href') j=0 print("开始") for mydemo_address in demo_address[1:]:#58完成 new_url_1 = url + mydemo_address demo2 = requests.get(new_url_1) demo2_address = etree.HTML(demo2.text).xpath('//a/@href') new_url_2 = new_url_1 + demo2_address[1] demo3 = requests.get(new_url_2, headers=headers) demo3_address = etree.HTML(demo3.text).xpath('//a/@href') new_url_3 = new_url_2 + demo3_address[3] demo4 = requests.get(new_url_3, headers=headers) f = open('D:\\' + str(j) + '.mobi', 'wb') # 将下载到的图片数据写入文件 f.write(demo4.content) f.close() print("第" + str(j) + "本下载完成") j = j+1这是写这篇博客的前几天写了,爬虫兴趣来了顺便整理了一下自己的爬虫笔记。此时一些免费电子书网站处在风口浪尖,有很多的都关闭了,我在到处找kindle电子书,发现了这个网站,就顺便重拾我的爬虫来爬一波,站长人太好了!
我把这个代码搞到了服务器上面运行
下载了九百多本就没有下载了,很有些慢,要用到多线程。在现在看来,可以不用去学threading库,直接用爬虫框架Scrapy,上面集成了很多库。
