如何在使用Scrapy Splash的Crawlera lua脚本中获取session_id?

2018-11-27 15:13:0

收藏：0

阅读：88

评论：2

正如您所知，当我们尝试使用Scrapy Splash和Crawlera时，我们会使用这个Lua脚本：

function use_crawlera(splash)
    -- 确保你在 'crawlera_user' 参数中传递了你的Crawlera API密钥。
    -- 可以查看文件spiders/quotes-js.py，了解如何执行此操作。
    -- 在https://app.scrapinghub.com/中找到你的Crawlera凭据
    local user = splash.args.crawlera_user

    local host = 'proxy.crawlera.com'
    local port = 8010
    local session_header = 'X-Crawlera-Session'
    local session_id = 'create'

    splash:on_request(function (request)
        request:set_header('X-Crawlera-Cookies', 'disable')
        request:set_header(session_header, session_id)
        request:set_proxy{host, port, username=user, password=''}
    end)

    splash:on_response_headers(function (response)
        if type(response.headers[session_header]) ~= nil then
            session_id = response.headers[session_header]
        end
    end)
end

function main(splash)
    use_crawlera(splash)
        splash:init_cookies(splash.args.cookies)
        assert(splash:go{
            splash.args.url,
            headers=splash.args.headers,
            http_method=splash.args.http_method,
        })
            assert(splash:wait(3))
        return {
            html = splash:html(),
            cookies = splash:get_cookies(),
        }
end

在那个Lua脚本中有一个session_id变量，我非常需要它，但我该如何从Scrapy的响应中访问它呢？

我已经尝试过response.session_id或response.headers['X-Crawlera-Session']但都不起作用。

用户939364

使用splash:set_result_header。

2019-07-05 13:41:12

用户3923463

在您的 lua 脚本中，同时返回 HAR 数据（https://splash.readthedocs.io/en/stable/scripting-ref.html#splash-har)：

    return {
        html = splash:html(),
        har = splash:har(),
        cookies = splash:get_cookies(),
    }

假设您在使用 scrapy-splash（https://github.com/scrapy-plugins/scrapy-splash），请确保将 execute 端点设置为您的请求：meta['splash']['endpoint'] = 'execute'。如果您使用 scrapy.Request，则 render.json 是默认的端点，但对于 scrapy_splash.SplashRequest，默认端点是 render.html。请参阅这两个示例，了解如何设置端点：https://github.com/scrapy-plugins/scrapy-splash#requests
现在您才可以在解析方法中访问 X-Crawlera-Session 头部：

    def parse(self, response):
        headers = json.loads(response.text)['har']['log']['entries'][0]['response']['headers']
        session_id = next(x for x in headers if x['name'] == 'X-Crawlera-Session')['value']

>>> headers = json.loads(response.text)['har']['log']['entries'][0]['response']['headers']
>>> next(x for x in headers if x['name'] == 'X-Crawlera-Session')
{u'name': u'X-Crawlera-Session', u'value': u'2124641382'}

2019-07-06 16:59:29

评论区的留言会收到邮件通知哦~

作者:

用户1564659

如何在使用Scrapy Splash的Crawlera lua脚本中获取session_id?

社区规范

发文指南

社区文章

开源项目 & 应用

🎮 游戏开发

World of Warcraft

Roblox

Defold

LÖVE 2D

🌐 高性能网络与 Web 服务

OpenResty

Kong

Redis

Nmap

LuaJIT

Wapiti

Wireshark

⚙️ 嵌入式系统与应用工具

LuatOS

TeX Live

Awesome WM

Vim/Neovim

FFmpeg

🧠 人工智能与科学计算

Torch

SciLua