中国大学排名爬虫代码实例分析
“中国大学排名定向爬虫”实例介绍定向爬虫可行性
程序的结构设计"中国大学排名定向爬虫"实例编写
“中国大学排名定向爬虫”实例介绍
网址:http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html
定向爬虫可行性
<tbody
class="hidden_zhpm" style
="text-align:center;">
<tr
class="alt"><td
>1</td
>
<td
><div align
="left">清华大学
</div
></td
>
<td
>北京市
</td
><td
>95.9</td
><td
class="hidden-xs need-hidden indicator5">100.0</td
><td
class="hidden-xs need-hidden indicator6" style
="display:none;">97.90%</td
><td
class="hidden-xs need-hidden indicator7" style
="display:none;">37342</td
><td
class="hidden-xs need-hidden indicator8" style
="display:none;">1.298</td
><td
class="hidden-xs need-hidden indicator9" style
="display:none;">1177</td
><td
class="hidden-xs need-hidden indicator10" style
="display:none;">109</td
><td
class="hidden-xs need-hidden indicator11" style
="display:none;">1137711</td
><td
class="hidden-xs need-hidden indicator12" style
="display:none;">1187</td
><td
class="hidden-xs need-hidden indicator13" style
="display:none;">593522</td
></tr
><tr
><td
>2</td
>
<td
><div align
="left">北京大学
</div
></td
>
<td
>北京市
</td
><td
>82.6</td
><td
class="hidden-xs need-hidden indicator5">98.9</td
><td
class="hidden-xs need-hidden indicator6" style
="display:none;">95.96%</td
><td
class="hidden-xs need-hidden indicator7" style
="display:none;">36137</td
><td
class="hidden-xs need-hidden indicator8" style
="display:none;">1.294</td
><td
class="hidden-xs need-hidden indicator9" style
="display:none;">986</td
><td
class="hidden-xs need-hidden indicator10" style
="display:none;">87</td
><td
class="hidden-xs need-hidden indicator11" style
="display:none;">439403</td
><td
class="hidden-xs need-hidden indicator12" style
="display:none;">799</td
><td
class="hidden-xs need-hidden indicator13" style
="display:none;">7343</td
></tr
><tr
class="alt"><td
>3</td
>
<td
><div align
="left">浙江大学
</div
></td
>
<td
>浙江省
</td
><td
>80</td
><td
class="hidden-xs need-hidden indicator5">88.8</td
><td
class="hidden-xs need-hidden indicator6" style
="display:none;">96.46%</td
><td
class="hidden-xs need-hidden indicator7" style
="display:none;">41188</td
><td
class="hidden-xs need-hidden indicator8" style
="display:none;">1.059</td
><td
class="hidden-xs need-hidden indicator9" style
="display:none;">803</td
><td
class="hidden-xs need-hidden indicator10" style
="display:none;">86</td
><td
class="hidden-xs need-hidden indicator11" style
="display:none;">959511</td
><td
class="hidden-xs need-hidden indicator12" style
="display:none;">833</td
><td
class="hidden-xs need-hidden indicator13" style
="display:none;">64392</td
></tr
><tr
><td
>4</td
>
定向爬虫可行性:通过robots协议查看 http://www.zuihaodaxue.cn/robots.txt
程序的结构设计
步骤1:从网络上获取大学排名网页内容 getHTMLText() 步骤2:提取网页内容中信息到合适的数据结构 fillUnivList() 步骤3:利用数据结构展示并输出结果 printUnivList()
"中国大学排名定向爬虫"实例编写
import requests
from bs4
import BeautifulSoup
def getHTMLText(url
):
try:
r
= requests
.get
(url
,timeout
=30)
r
.raise_for_status
()
r
.encoding
= r
.apparent_encoding
return r
.text
except:
return""
import bs4
def fillUnivList(ulist
,html
):
soup
= BeautifulSoup
(html
,"html.parser")
for tr
in soup
.find
('tbody').children
:
if isinstance(tr
,bs4
.element
.Tag
):
tds
= tr
('td')
ulist
.append
([tds
[0].string
,tds
[1].string
,tds
[3].string
])
def printUnivList(ulist
,num
):
tplt
= "{0:^10}\t{1:{3}^10}\t{2:^10}"
print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分",chr(12288)))
for i
in range(num
):
u
=ulist
[i
]
print(tplt
.format(u
[0],u
[1],u
[2],chr(12288)))
def main():
uinfo
= []
url
= 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
html
= getHTMLText
(url
)
fillUnivList
(uinfo
,html
)
printUnivList
(uinfo
,20)
main
()