selenium+Xpath爬取电影中出现的问题与源码网站首页 学无止境

selenium+Xpath爬取电影中出现的问题与源码

钢盔兔 2023-06-28 12:00:04

简介selenium+Xpath爬取电影中出现的问题与源码

因为要准备考研好久没更新了，因为要展示数据采集的作业（本来是打算想随便应付一下，但是因为一直不主动，结果自己的被别人说的差不多了，所以不得以推翻重来。所以说呀，做事你要主动一点），所以浅浅的更新一波：

这次是为了爬取douban的top250的电影信息,页面规则非常简单，url的规律非常好找，只需要改一个参数就行了，而且也没有加密。除此之外也没有什么反爬措施和ajax技术需要注意的。我就直接粘贴代码了，最后在指出几个遇到的问题：（所以这次更像是大炮打苍蝇一样）

代码：

from lxml import etree
import csv,os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from lxml import etree
import time
import sys



def windows_child(url):
    windows_child = webdriver.Chrome(service=s,chrome_options = chrome_options)  #访问子窗口(也可以直接点击，但是效果一样)
    windows_child.get(url)                 #访问第一页的网址
    text=windows_child.page_source                              #获取源码
    tree=etree.HTML(text)                                       #得到etree对象
    #排名
    position=''.join(tree.xpath('/html/body/div[3]/div[1]/div[1]/span[1]/text()'))
    #电影名
    name=''.join(tree.xpath('/html/body/div[3]/div[1]/h1/span[1]/text()'))
    #评分
    rating=''.join(tree.xpath('/html/body/div[3]/div[1]/div[3]/div[1]/div[1]/div[1]/div[2]/div/div[2]/strong/text()'))
    #简介
    brief_info=''.join(tree.xpath('/html/body/div[3]/div[1]/div[3]/div[1]/div[3]/div/span[1]//text()')).strip().replace('
', '')       #去掉首尾和中间的空字符

    windows_child.close()   #关闭子页面,为什么不用quit方法，是因为会关掉所有相关联的窗口，而close只关闭当前窗口

    return position,name,rating,brief_info


# 解析网页
def page_parse(html):

    for i in range(1,26):
        # 获取电影链接
        href = ''.join(tree.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li[{}]/div/div[2]/div[1]/a/@href".format(i)))
        ####这里要写一个函数去点进电影链接，进行更为详细的爬取
        position,name,rating,brief_info=windows_child(href)

        text=position+'	'+ name +'	' + rating + '	' + brief_info+'	'+ href
        yield text

if __name__ == "__main__":
    print('**************开始爬取豆瓣电影**************')   

    #防止证书报错
    chrome_options = webdriver.ChromeOptions()
    # 忽略证书错误
    chrome_options.add_argument('--ignore-certificate-errors')
    # 忽略 Bluetooth: bluetooth_adapter_winrt.cc:1075 Getting Default Adapter failed. 错误
    chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
    # 忽略 DevTools listening on ws://127.0.0.1... 提示
    chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
    #无头浏览器(注释掉的话就不会有浏览器了)
    # options.add_argument('--headless')
    # options.add_argument('--disable-gpu')
     
    s = Service(executable_path='./chromedriver')
    window = webdriver.Chrome(service=s,chrome_options = chrome_options)                        #指定浏览器


    window.get('https://movie.douban.com/top250?start=0')       #访问第一页的网址
    filename="douban_movie.csv"                                 #最终文件的存储名
    if os.path.exists(filename):os.remove(filename)             #如果文件存在就删掉重来(调试用)

    for iter in range(0,251,25):                #访问那么多次
        # time.sleep(2)                                               #以防速度太快ajax没加载出来,但是在这里没啥用，因为不涉及ajax技术
        text=window.page_source                                     #获取源码
        tree=etree.HTML(text)                                       #得到etree对象
        html_url=window.current_url                                 #得到当前网页的url
        print(window.current_url)                                   #得到当前网页的url

        with open(filename,'a+',newline='',encoding='utf-8') as res_file:            #以追加的方式写入
            file=csv.writer(res_file)
            if iter==0: file.writerow(['排名','电影名','评分','简介','链接'])   #表头
            page_generator = page_parse(html=html_url)
            for _ in range(25):
                text=next(page_generator)
                file.writerow(text.split('	'))
                print(text)

        next_page=window.find_element('class name','next')   #定位下一页的按钮
        next_page.click()
        
    print('**************爬取完成**************')

    # time.sleep(2)
    window.quit()   #关闭页面

遇到的问题：

1：UnicodeEncodeError: 'gbk' codec can't encode character 'u2022' in position 130: illegal multibyte sequence：

网上的原因也有很多，但是我是采用：

open(filename,'a+',newline='',encoding='utf-8')

with open(filename,'a+',newline='',encoding='utf-8') as res_file:

来解决的，也就是写入文件时指定编码格式就行了

2：

这个是代码写错了，定位元素的时候不只class，还有class name

3：15244:13796:0426/003507.148:ERROR:ssl_client_socket_impl.cc(992)] handshake failed; returned -1, SSL error code 1, net_error -100，等waring报错

这些报错都不会影响你的代码运行，但是就是不好看，你可以给你的webdriver加上一些证书即可，比如：

    #防止证书报错
    chrome_options = webdriver.ChromeOptions()
    # 忽略证书错误
    chrome_options.add_argument('--ignore-certificate-errors')
    # 忽略 Bluetooth: bluetooth_adapter_winrt.cc:1075 Getting Default Adapter failed. 错误
    chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
    # 忽略 DevTools listening on ws://127.0.0.1... 提示
    chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])

4：selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only sCurrent browser version is 112.0.5615.138 with binary path C:UserscslAppDataLocalGoogleChromeApplicationchrome.exe

大概意思就是，你下载的驱动和你当前的谷歌版本不一致，只要找到对应的版本下载好就行了，一定要是和你电脑上的谷歌浏览器对应版本的才行，最新的都不行，找不到完全对应的就找最接近的就行，下面给出对应网站：CNPM Binaries Mirror，在这个上面下载即可

5：怎么只关掉其中一个web窗口，而不关掉主窗口？

这涉及.close()方法与.quit()方法的区别，前者是只关掉当前正在进行操作的窗口，后者则是关掉所有窗口，相当于直接把浏览器给关掉

剩下的也没什么好讲的，毕竟比较简单，代码上注释也已经很详细了哈

风语者！平时喜欢研究各种技术，目前在从事后端开发工作，热爱生活、热爱工作。