XML:由名称 name;属性 Attribute;标签 tag构成
<img src="zhangliang.jpg" size="10">...</img>JSON: 由键 key ;值 value 对构成
"key" :"value" "key":["value1","value2"] "key":{"subkey":"subvalue"}YAML:无类型键值对:key : value
key:value key:#Comment -value1 -value2 key: subkey : subvalue name :Python张良思路:1)搜索到所有标签 2)解析标签格式,提取href后的链接内容
>>> import requests >>> r = requests.get("http://python123.io/ws/demo.html") >>> demo = r.text >>> demo '<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>' >>> >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(demo,"html.parser") >>> for link in soup.find_all('a'):print(link.get('href')) http://www.icourse163.org/course/BIT-268001 http://www.icourse163.org/course/BIT-1001870001返回一个列表类型,存储查找的结果
>>> soup.find_all('a') [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>] >>> soup.find_all(['a','b']) [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>] >>> for tag in soup.find_all(True):print(tag.name) html head title body p b p a a >>> import re >>> for tag in soup.find_all(re.compile('b')):print(tag.name) body b >>>.name:对标签名称的检索字符串 .attrs:对标签属性值的检索字符串,可标注属性检索
>>> soup.find_all('p','course') [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>] >>> soup.find_all(id='link1') [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>] >>> soup.find_all(id='link') [] >>> import re >>> soup.find_all(id=re.compile('link')) [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>].recursive:是否对子孙全部检索,默认True
>>> soup.find_all('a') [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>] >>> soup.find_all('a',recursive=False) []string:<>…</>中字符串区域的检索字符串
>>> soup <html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p> </body></html> >>> soup.find_all(string='Basic Python') ['Basic Python'] >>> import re >>> soup.find_all(string=re.compile('python')) ['This is a python demo page', 'The demo python introduces several python courses.']