爬虫之数据解析

技术2022-07-20 70

数据解析

正则(字符串匹配)

Re模块（表达式，方法）

1)规则

https://blog.csdn.net/CareChere/article/details/52315728?

2)检测

https://regex101.com/#javascript

xpath

1)安装

# 安装支持解析html和XML的解析库 lxml from lxml import etree

2）使用方法

1.转解析类型 xpath_data = etree.HTML(data) 2.调用 xpath的方法 result = xpath_data.xpath('/html/head/title//text()') 3.xpath语法 1. 节点 / 2. 跨节点: // 3. 精确的标签: //a[@属性="属性值"] 属性（mon/class） 4. 标签包裹的内容 text() ；取网站：@href 6. xpath返回数据类型->list 7. xpath 下标是从 1开始; 8. '//a[2]' 中[]只能取平级关系的标签不能跨节点 9. # 路径 1. 纯手写 2. 借助浏览器的右击粘贴xpath路径; 需要修改 10.模糊查询 //div[contain(@class,"a")] 11.取下一个节点(平级) 4.保存数据格式-->json # 将 list---str data_str = json.dumps(self.data_list)

bs4

安装 pip install beautifulsoup4 from bs4 import BeautifulSoup

2）使用

四大类型：BeautifulSoup；Tag；NavigableString；Comment

1.转类型 soup = BeautifulSoup(html_doc, 'lxml') 2. 解析数据 # find--返回符合查询条件的第一个标签对象 result = soup.find(name='p',attrs={"class": "story"}) # find_all--list(标签对象) result = soup.find_all(name='p',attrs={"class": "story"}) # select_one---css选择器 result = soup.select_one('.sister') # select----css选择器---返回list #类选择器 result = soup.select('.sister') #id选择器 result = soup.select('#one') #后代选择器 result = soup.select('head title') #组选择器 result = soup.select('title,.title') #属性选择器 result = soup.select('a[id="link3"]') # 取出标签包裹的内容---list result = soup.select('.title')[0].get_text() # 标签的属性 result = soup.select('#link1')[0].get('href')

对比

难易正则>xpth>bs4

效率正则>xpth>=bs4

html & xml(json)

html：前端，用来给用户看

xml：前端移动端和后台交互的数据形式

common：都可以用xpath和bs解析

Processed: 0.021, SQL: 9