获取调试信息:使用Lua脚本,循环超时进行Scrapy-Splash爬取时出现(504)错误

我是编程的新手,正在努力构建一个web爬虫。我使用Lua脚本,以使我的Scrapy请求等待任何Web元素(不关心哪个元素,我只需要最初的页面加载器完成加载,以便我可以访问HTML元素)在网站上的JavaScript加载完成后出现。我想访问的特定网站是https://www.ladbrokes.com.au/sports/basketball/usa/nba,在任何网站元素加载之前都有JS初始加载器页面。

目前我的代码如下:

class Ladbrokes(scrapy.Spider):

      name = 'Ladbrokes'
      allowed_domains = ['ladbrokes.com.au']
      start_urls = ['https://www.ladbrokes.com.au/sports']

      def parse (self, response):

           sports_link = select_ladbrokes(response)

           for link in sports_link:
                url = response.urljoin(link)
                yield SplashRequest(url = url, callback =self.ladbrokes_all_comps,endpoint='execute',
                            args={'lua_source':lua_script})



      def ladbrokes_all_comps(self, response):
           comps = response.xpath('//*[@id="accordion_4e099d27-0f11-4c6e-848e-965fff7ad995"]/div[2]/div[2]/div[1]/div[2]/div[1]/div/div[1]/text()').extract()

lua_script = '''
   function main(splash)

      assert(splash:go(splash.args.url))
      while not splash:select('#page-content-left > div > div') do
         splash:wait(0.1)
      end
      return {html=splash:html()}
    end '''

当我调用我的Spider时,我最终得到了这些错误:

2019-11-25 16:41:30 [scrapy.core.engine] DEBUG: Crawled (504) <GET https://www.ladbrokes.com.au/sports/nrl via http://0.0.0.0:8050/execute> (referer: None)
2019-11-25 16:41:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://www.ladbrokes.com.au/sports/nrl>: HTTP status code is not handled or not allowed

似乎在Lua脚本While循环上超时了,但我不确定是因为我选错了Web元素还是因为其他原因。

我还尝试在SplashRequest函数中设置长时间的‘wait’参数,但似乎初始页面加载器永远没有完成加载。任何关于此问题的帮助都将是有益的!

点赞