使用BeautifulSoup、pandas解析网页,爬取数据
目标网站:http://s.askci.com/data/economy/00002/1/第一种解析方法:使用BeautifulSoup代解析网页
import requests
import urllib3
urllib3.disable_warnings()
from bs4 import BeautifulSoup
from urllib.parse import urlencode
for i in range(1,3):
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' \
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}
paras = {
'reportTime':2020-6-30,
'pageNum':i
}
#1.url
url = 'https://s.askci.com/stock/a/0-0?' + urlencode(paras)
#2.请求资源
response = requests.get(url)
response.encoding = response.apparent_encoding
html = response.text
#3.解析网页 正则、xpath、bs4
# 第一种解析方法
soup = BeautifulSoup(html,'lxml')
#print(soup)
tr_list = soup.find_all('tbody')
for data in tr_list:
print(data.text.split())
第二种解析方法: 使用pandas解析网页
import pandas as pd
import requests
import urllib3
urllib3.disable_warnings()
url = "https://s.askci.com/data/economy/00002/1/"
response = requests.get(url,verify=False)
response.encoding = response.apparent_encoding
html = response.text
# 第二种解析方法
tb = pd.read_html(html,header=0)[0]
print(tb)
tb.to_csv("economic_indicators.csv")
第三种解析方法: 还是使用pandas解析网页
import pandas as pd
import urllib3
urllib3.disable_warnings()
url = "https://s.askci.com/data/economy/00002/1/"
# 第三种解析方法
tb = pd.read_html(url,header=0)[0]
tb.to_csv("economic_indicators.csv")
最后一种解析方式最简洁,也易于理解