爬豆瓣日剧清单并保存为csv(爬的时候发现的问题以及解决办法)

    技术2023-10-01  92

    用火狐网络监视器查看发现数据是json类型,先尝试查看清单内容

    #coding = utf-8 import re,requests import json url = 'https://movie.douban.com/j/search_subjects?type=tv&tag=%E6%97%A5%E5%89%A7&sort=recommend&page_limit=20&page_start=20' ut = requests.get(url).text uj = json.loads(ut) uj

    结果报错。

    JSONDecodeError: Expecting value: line 1 column 1 (char 0)

    json参数不符合要求,查看ut发现为空并且<Response [418]>

    添加headers后

    #coding = utf-8 import re,requests import json headers ={ 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 'Connection': 'keep-alive', 'Host': 'movie.douban.com', 'Referer': 'https://movie.douban.com/tv/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0' } url = 'https://movie.douban.com/j/search_subjects?type=tv&tag=%E6%97%A5%E5%89%A7&sort=recommend&page_limit=20&page_start=20' ut = requests.get(url,headers=headers).text uj = json.loads(ut) uj 结果依然是报错,不过<Response [200]>,但是ut却出现乱码 '�(�\x00�H��bvg�B\\�eM�d\x13t��!&��&��o��\x05\x18X\x1aj`�3�G\rl7����w�x\x07xx�ѥ�*4��SS;慊��\'�U\n�J���1踸�\x01\x0c\x0e\x0e��%zֱ��o��x��Ҫo|���q��;�\t�=�Q��mL���_㈒m<U��q\x0c=��s����rs��]��aHr�m�W��uu)�m|��N\x18Z}=�\\s���g\x1f�?y��Ή�5�x˚*���5�6>�{�{:E#��G���gh\x16�|��M�\\��S\x0f������m%�\x0f�i|�eM\x02�_�\x13C\x0c��~�E�6^?��p@�=��E\x08e\x1b��>����/+��\x11EWG��=O\x01O5\x17\r\x15J\x02�\x10Q�\x10v7��#\x1e����.�\x12��\x07��#F��%�w���\r\x17�sPc!t[\\\x18��J&<Xi��p����݈0EFE\x06\x17�y^�\x05cD\x03WR\x06\r����c�\x10p�ć�\x05P�iP\x08\x07\x009�d��k\x11DF���B=\x1d\x0b�4��sDf�P�xh/�J��q\x07\x10{c&d x���QFR����H\x08�膗6O\x1a^\r%�\x0fXy\x08\x1f�\x15TU$"7\x00\x0e�\x7f�\x17�/p��t�Q�HLf\x14�єԛHe�Y\'�>`�ud�\x1d\x0b����l{q�����v$���\x0cJˎ�0\x074}�^,8��K4\x7fI(X�\x07�uT��0�H��\x15�\x0c1\x11�1m�Mv\x1ch%�\'�B\x04\x05\x01��=���?\x1e\x15\x10D�P6w`�\x04bf�F�k�l����c��\x19�n��Q��Hq�\r�j�L����&C7��w��Υi@kڸ�8�\x12�K帤Q�h��]�\x07�0\x03#S�y\x00 ��\x08��2��\rN�ʥ,�\x15Xf�#YX\x1054B�\x03�|��Xh��7�#�\x13#k\x18@��ڍ�r�\U00032da8B��qϑ�*G,j\x0f���|KN�\x05�"\x17��\x0b)�^��J�4���8H){�Y\\�\x04!�y��i;͋sBY\x0e9�r�]�xPrv��%�\tf\x1f���\x10ύ�x\x05�+�\x14@ag��6�In�b4�;��\x11�HT�C:\x00�#zd�E�K�H���\x18\r�<����\x1d�-\x0e2�\n\x07�\uf449&R������NQi��\x10��TV�\x05���9J)G&,?��gZ�����z�D�N�����a\x08(��\x1f'

    去掉headers中Accept-Encoding就正常了,以下是完整代码

    #coding = utf-8 import requests import json,csv headers ={ 'Connection': 'keep-alive', 'Host': 'movie.douban.com', 'Referer': 'https://movie.douban.com/tv/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0' } #url = 'https://movie.douban.com/j/search_subjects?type=tv&tag=%E6%97%A5%E5%89%A7&sort=recommend&page_limit=20&page_start=20' def gf(url): ut = requests.get(url,headers=headers).text uj = json.loads(ut) data = uj['subjects'] for d in data: rate = d['rate'] title = d['title'] url = d['url'] cf = [title,rate,url] with open('japan.csv','a',newline='') as f: file = csv.writer(f) file.writerow(cf) if __name__ == '__main__': for i in range(10): url = 'https://movie.douban.com/j/search_subjects?type=tv&tag=%E6%97%A5%E5%89%A7&sort=recommend&page_limit=20&page_start='+str(i*20) gf(url)

     

     

    Processed: 0.008, SQL: 9