Python网络爬虫之信息组织与提取方法学习笔记手札及代码实战

技术2022-07-14 82

信息组织与提取方法

学习笔记手札及单元小结信息标记的三种形式基于bs4库的HTML内容查找方法

学习笔记手札及单元小结

信息标记的三种形式

XML:由名称 name;属性 Attribute;标签 tag构成

JSON：由键 key ；值 value 对构成

"key" :"value" "key":["value1","value2"] "key":{"subkey":"subvalue"}

YAML:无类型键值对：key : value

key:value key:#Comment -value1 -value2 key: subkey : subvalue name :Python张良

基于bs4库的HTML内容查找方法

思路：1）搜索到所有标签 2)解析标签格式，提取href后的链接内容

>>> import requests >>> r = requests.get("http://python123.io/ws/demo.html") >>> demo = r.text >>> demo '<html><head><title>This is a python demo page</title></head>\r\n<body>\r\nThe demo python introduces several python courses.\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.\r\n</body></html>' >>> >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(demo,"html.parser") >>> for link in soup.find_all('a'):print(link.get('href')) http://www.icourse163.org/course/BIT-268001 http://www.icourse163.org/course/BIT-1001870001

返回一个列表类型，存储查找的结果

>>> soup.find_all('a') [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>] >>> soup.find_all(['a','b']) [The demo python introduces several python courses., <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>] >>> for tag in soup.find_all(True):print(tag.name) html head title body p b p a a >>> import re >>> for tag in soup.find_all(re.compile('b')):print(tag.name) body b >>>

.name:对标签名称的检索字符串 .attrs:对标签属性值的检索字符串，可标注属性检索

>>> soup.find_all('p','course') [Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.] >>> soup.find_all(id='link1') [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>] >>> soup.find_all(id='link') [] >>> import re >>> soup.find_all(id=re.compile('link')) [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

.recursive：是否对子孙全部检索，默认True

string:<>…</>中字符串区域的检索字符串

>>> soup <html><head><title>This is a python demo page</title></head> <body> The demo python introduces several python courses. Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>. </body></html> >>> soup.find_all(string='Basic Python') ['Basic Python'] >>> import re >>> soup.find_all(string=re.compile('python')) ['This is a python demo page', 'The demo python introduces several python courses.']

Processed: 0.011, SQL: 9