爬虫之爬取网页表格数据(四)

    技术2022-07-10  96

    使用BeautifulSoup、pandas解析网页,爬取数据

    目标网站:http://s.askci.com/data/economy/00002/1/第一种解析方法:使用BeautifulSoup代解析网页 import requests import urllib3 urllib3.disable_warnings() from bs4 import BeautifulSoup from urllib.parse import urlencode for i in range(1,3): headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' \ 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36' } paras = { 'reportTime':2020-6-30, 'pageNum':i } #1.url url = 'https://s.askci.com/stock/a/0-0?' + urlencode(paras) #2.请求资源 response = requests.get(url) response.encoding = response.apparent_encoding html = response.text #3.解析网页 正则、xpath、bs4 # 第一种解析方法 soup = BeautifulSoup(html,'lxml') #print(soup) tr_list = soup.find_all('tbody') for data in tr_list: print(data.text.split()) 第二种解析方法: 使用pandas解析网页 import pandas as pd import requests import urllib3 urllib3.disable_warnings() url = "https://s.askci.com/data/economy/00002/1/" response = requests.get(url,verify=False) response.encoding = response.apparent_encoding html = response.text # 第二种解析方法 tb = pd.read_html(html,header=0)[0] print(tb) tb.to_csv("economic_indicators.csv") 第三种解析方法: 还是使用pandas解析网页 import pandas as pd import urllib3 urllib3.disable_warnings() url = "https://s.askci.com/data/economy/00002/1/" # 第三种解析方法 tb = pd.read_html(url,header=0)[0] tb.to_csv("economic_indicators.csv")

    最后一种解析方式最简洁,也易于理解

    Processed: 0.016, SQL: 9