Beautiful Soup库入门学习
学习笔记手札及单元小结Beautiful Soup库的安装Beautiful Soup库的安装小测Beautiful Soup 库的基本元素Beautiful Soup库的引用BeautifulSoup类Tag标签Tag的name(名字)Tag的attrs(属性)Tag的NavigableStringTag的Comment
基于bs4库的HTML内容遍历方法标签树的下行遍历标签树的上行遍历标签树的平行遍历
基于bs4库的HTML格式输出bs4库的prettify()方法bs4库的编码
学习笔记手札及单元小结
Beautiful Soup库的安装
https://www.crummy.com/software/BeautifulSoup/
Win平台:“以管理员身份运行”cmd 执行 pip install beautifulsoup4
Beautiful Soup库的安装小测
此处用Requests库获取demo.html源代码:
>>> import requests
>>> r
= requests
.get
("http://python123.io/ws/demo.html")
>>> r
.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo
= r
.text
此处引入BeautifulSoup库
>>> from bs4
import BeautifulSoup
>>> soup
= BeautifulSoup
(demo
,'html.parser')
>>> print(soup
.prettify
())
<html
>
<head
>
<title
>
This
is a python demo page
</title
>
</head
>
<body
>
<p
class="title">
<b
>
The demo python introduces several python courses
.
</b
>
</p
>
<p
class="course">
Python
is a wonderful general
-purpose programming language
. You can learn Python
from novice to professional by tracking the following courses
:
<a
class="py1" href
="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a
>
and
<a
class="py2" href
="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a
>
.
</p
>
</body
>
</html
>
from bs4
import BeautifulSoup
soup
= BeautifulSoup
('<p>data</p>','html.parser')
Beautiful Soup 库的基本元素
Beautiful Soup库是解析 遍历 维护 “标签树”的功能库
Beautiful Soup库的引用
from bs4
import BeautifulSoup
import bs4
BeautifulSoup类
>>> from bs4
import BeautifulSoup
>>> soup
= BeautifulSoup
("<html>data</html>","html.parser")
>>> soup2
= BeautifulSoup
(open("D://demo.html"),"html.parser")
Tag标签
>>> from bs4
import BeautifulSoup
>>> soup
= BeautifulSoup
(demo
,"html.parser")
>>> soup
.title
<title
>This
is a python demo page
</title
>
>>> tag
= soup
.a
>>> tag
<a
class="py1" href
="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python
</a
>
Tag的name(名字)
>>> from bs4
import BeautifulSoup
>>> soup
= BeautifulSoup
(demo
,"html.parser")
>>> soup
.a
.name
'a'
>>> soup
.a
.parent
.name
'p'
>>> soup
.a
.parent
.parent
.name
'body'
Tag的attrs(属性)
一个可以有0或多个属性,字典类型
>>> tag
= soup
.a
>>> tag
.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag
.attrs
['class']
['py1']
>>> tag
.attrs
['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag
.attrs
)
<class 'dict'>
>>> type(tag
)
<class 'bs4.element.Tag'>
Tag的NavigableString
NavigableString可以跨越多个层次
>>> soup
.a
<a
class="py1" href
="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python
</a
>
>>> soup
.a
.string
'Basic Python'
>>> soup
.p
<p
class="title"><b
>The demo python introduces several python courses
.</b
></p
>
>>> soup
.p
.string
'The demo python introduces several python courses.'
>>> type(soup
.p
.string
)
<class 'bs4.element.NavigableString'>
Tag的Comment
Comment是一种特殊类型
>>> newsoup
= BeautifulSoup
("<b><!--This is a comment--></b><p>This is not a coment</p>","html.parser")
>>> newsoup
.b
.string
'This is a comment'
>>> type(newsoup
.b
.string
)
<class 'bs4.element.Comment'>
>>> newsoup
.p
.string
'This is not a coment'
>>> type(newsoup
.p
.string
)
<class 'bs4.element.NavigableString'>
基于bs4库的HTML内容遍历方法
标签树的下行遍历
>>> soup
= BeautifulSoup
(demo
,"html.parser")
>>> soup
.head
<head
><title
>This
is a python demo page
</title
></head
>
>>> soup
.head
.contents
[<title
>This
is a python demo page
</title
>]
>>> soup
.body
.contents
['\n', <p
class="title"><b
>The demo python introduces several python courses
.</b
></p
>, '\n', <p
class="course">Python
is a wonderful general
-purpose programming language
. You can learn Python
from novice to professional by tracking the following courses
:
<a
class="py1" href
="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python
</a
> and <a
class="py2" href
="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python
</a
>.</p
>, '\n']
>>> len(soup
.body
.contents
)
5
>>> soup
.body
.contents
[1]
<p
class="title"><b
>The demo python introduces several python courses
.</b
></p
>
标签树的上行遍历
>>> soup
= BeautifulSoup
(demo
,"html.parser")
>>> soup
.title
.parent
<head
><title
>This
is a python demo page
</title
></head
>
>>> soup
.html
.parent
<html
><head
><title
>This
is a python demo page
</title
></head
>
<body
>
<p
class="title"><b
>The demo python introduces several python courses
.</b
></p
>
<p
class="course">Python
is a wonderful general
-purpose programming language
. You can learn Python
from novice to professional by tracking the following courses
:
<a
class="py1" href
="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python
</a
> and <a
class="py2" href
="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python
</a
>.</p
>
</body
></html
>
>>> soup
.parent
标签树的平行遍历
>>> soup
= BeautifulSoup
(demo
,"html.parser")
>>> soup
.a
.next_sibling
' and '
>>> soup
.a
.next_sibling
.next_sibling
<a
class="py2" href
="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python
</a
>
>>> soup
.a
.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup
.a
.previous_sibling
.previous_sibling
>>> soup
.a
.parent
<p
class="course">Python
is a wonderful general
-purpose programming language
. You can learn Python
from novice to professional by tracking the following courses
:
<a
class="py1" href
="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python
</a
> and <a
class="py2" href
="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python
</a
>.</p
>
基于bs4库的HTML格式输出
bs4库的prettify()方法
>>> import requests
>>> r
= requests
.get
("http://python123.io/ws/demo.html")
>>> demo
= r
.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> from bs4
import BeautifulSoup
>>> soup
= BeautifulSoup
(demo
,"html.parser")
>>> soup
.prettify
()
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> print(soup
.prettify
())
<html
>
<head
>
<title
>
This
is a python demo page
</title
>
</head
>
<body
>
<p
class="title">
<b
>
The demo python introduces several python courses
.
</b
>
</p
>
<p
class="course">
Python
is a wonderful general
-purpose programming language
. You can learn Python
from novice to professional by tracking the following courses
:
<a
class="py1" href
="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a
>
and
<a
class="py2" href
="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a
>
.
</p
>
</body
>
</html
>
>>> print(soup
.a
.prettify
())
<a
class="py1" href
="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a
>
bs4库的编码
bs4库将任何HTML输入都变成utf-8编码
>>> soup
= BeautifulSoup
("<p>中文</p>","html.parser")
>>> soup
.p
.string
'中文'
>>> print(soup
.p
.prettify
())
<p
>
中文
</p
>