在网络爬虫中,经常需要设置一些头信息。设置头信息的作用是伪装网络爬虫,使得网络爬虫请求网页更像浏览器访问网页(当然也可以通过java的selenium框架来实现),进而降低网络爬虫被网站封锁的风险。Jsoup中提供了两种设置头信息的方法。 第一种方法:每次只设置一个请求头,如果要设置多个请求头,需要多次调用此方法;第二种方法:添加多个请求头至Map集合。在程序3-3中,设置了一个请求头。在程序3-4中,设置了多个请求头,这些请求头来源于网络抓包的内容。
//程序3-3 public class JsoupConnectHeader { public static void main(String[] args) throws IOException { Connection connect = Jsoup.connect("https://searchcustomerexperience.techtarget.com/info/news"); //设置单个请求头 Connection conheader = connect.header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"); Document document = conheader.get(); System.out.println(document); } } //程序3-4 public class JsoupConnectHeaderMap { public static void main(String[] args) throws IOException { Connection connect = Jsoup.connect("https://searchcustomerexperience.techtarget.com/info/news"); //设置多个请求头,头信息保存到Map集合中 Map<String, String> header = new HashMap<String, String>(); header.put("Host", "searchcustomerexperience.techtarget.com"); header.put("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"); header.put("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"); header.put("Accept-Language", "zh-cn,zh;q=0.5"); header.put("Accept-Encoding", "gzip, deflate"); header.put("Cache-Control", "max-age=0"); header.put("Connection", "keep-alive"); Connection conheader = connect.headers(header); Document document = conheader.get(); System.out.println(document); } }