Scrapy-splash中动态的'wait'参数

2020-8-6 8:56:48

收藏：0

阅读：136

评论：0

我正在使用Scrapy-Splash爬取多个页面。

class Spider(scrapy.Spider):
    name = "scrape"

    def start_requests(self):
        urls = get_urls()
        for url in urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 8 }
                }
            })

代码可以正常工作，我从页面中得到了想要的结果。

问题是，我必须设置更长的等待时间（>4）或者Splash有时会在返回结果之前被下一个请求终止。这似乎非常不可靠。

有没有办法将等待时间设置为更动态的时间？我在这里找到了部分解决方案，使用LUA脚本：

Adding a wait-for-element while performing a SplashRequest in python Scrapy

function main(splash)
  splash:set_user_agent(splash.args.ua)
  assert(splash:go(splash.args.url))

  -- requires Splash 2.3
  while not splash:select('.my-element') do
    splash:wait(0.1)
  end
  return {html=splash:html()}
end

但它似乎需要一个硬编码的元素来终止Splash（“.my-element”），我正在爬取许多不同的网站并收集不同的元素。

我应该如何动态编码'wait'参数或自定义LUA脚本，以便在收集到所需的元素时终止Splash？这肯定是一个常见的问题吧？

评论区的留言会收到邮件通知哦~

作者:

用户2896718

Scrapy-splash中动态的'wait'参数

社区规范

发文指南

社区文章

开源项目 & 应用

🎮 游戏开发

World of Warcraft

Roblox

Defold

LÖVE 2D

🌐 高性能网络与 Web 服务

OpenResty

Kong

Redis

Nmap

LuaJIT

Wapiti

Wireshark

⚙️ 嵌入式系统与应用工具

LuatOS

TeX Live

Awesome WM

Vim/Neovim

FFmpeg

🧠 人工智能与科学计算

Torch

SciLua