遍历声音示例页面,下载每个示例并将行放入文本文件中

这里是我试图完成此操作的页面。这是来自Portal中GLaDOS的语音行。每一行都是内部"i"HTML文本以及在页面上显示的引号之间的文本。它们每个都有一个直接的下载链接,在其旁边标有"download"。我正在尝试将语音行放入MARY TTS语音合成器这里中的两种格式之一。要么是每行都有自己的文本文件,文件名与wav文件的名称匹配,要么是格式化为一个文本文件中的所有行( filename " insert line here")。

我曾试图自己做这件事,但已经花了4个小时,只得到了一小段不起作用的Python代码。

from bs4 import BeautifulSoup
import re
import urllib.request
soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('i')
f = open('Lines.txt', 'w')
for t in range(len(tags)):
    f.write(tags[t] + '\n')

f.close()

它返回"TypeError: unsupported operand type(s) for +: 'Tag' and 'str'."

我还尝试过AutoHotKey。

^g::

IEGet(Name="")        ;Retrieve pointer to existing IE window/tab
{
    IfEqual, Name,, WinGetTitle, Name, ahk_class IEFrame
        Name := ( Name="New Tab - Windows Internet Explorer" ) ? "about:Tabs"
        : RegExReplace( Name, " - (Windows|Microsoft) Internet Explorer" )
    For wb in ComObjCreate( "Shell.Application" ).Windows
        If ( wb.LocationName = Name ) && InStr( wb.FullName, "iexplore.exe" )
            Return wb
} ;written by Jethrow

wb := IEGet()

IELoad(wb)    ;You need to send the IE handle to the function unless you define it as global.
{
    If !wb    ;If wb is not a valid pointer then quit
        Return False
    Loop    ;Otherwise sleep for .1 seconds untill the page starts loading
        Sleep,100
    Until (wb.busy)
    Loop    ;Once it starts loading wait until completes
        Sleep,100
    Until (!wb.busy)
    Loop    ;optional check to wait for the page to completely load
        Sleep,100
    Until (wb.Document.Readystate = "Complete")
Return True
}

For IE in ComObjCreate("Shell.Application").Windows ; for each open window
If InStr(IE.FullName, "iexplore.exe") ; check if it's an ie window
break ; keep that window's handle
; this assumes an ie window is available. it won't work if not

IE.Navigate("http://theportalwiki.com/wiki/GLaDOS_voice_lines")
While IE.Busy
    Sleep, 100
Links := IE.Document.Links

Inner := FileOpen("C:\Users\Johnson\Desktop\GLaDOS Voice", "w")
Rows := IE.Document.All.Tags("table")[4].Rows
    Loop % Rows.Length
        Inner.Write(Row[A_Index].InnerText . "`r`n")

Inner.Close()
Return

据我所知,AutoHotKey脚本什么也没做。我使用快捷键,什么也不发生。

我更喜欢Lua,因为它是一致的并且我理解它。

点赞
用户94559
用户94559

你的 Python 代码离成功很近了。以下是微小的修复(并使用文件的上下文管理器):

from bs4 import BeautifulSoup
import urllib.request

soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('i')

with open('Lines.txt', 'w') as f:
    for t in range(len(tags)):
        f.write(tags[t].text.strip('“”') + '\n')

Lines.txt:

你只需要客观地看待事物,看看哪些是不再需要的,然后修剪掉那些多余的部分。
Portal
Portal 2

欢迎来到亚诺派特科技计算辅助增值中心。
...

编辑

回答下面评论中的问题,这应该可以获取下载链接:

from bs4 import BeautifulSoup
import urllib.request

soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('a')

with open('Downloads.txt', 'w') as f:
    for tag in tags:
        if tag.text == 'Download':
            f.write(tag['href'] + '\n')

Downloads.txt:

http://i1.theportalwiki.net/img/e/e5/GLaDOS_00_part1_entry-1.wav
http://i1.theportalwiki.net/img/d/d7/GLaDOS_00_part1_entry-2.wav
http://i1.theportalwiki.net/img/5/50/GLaDOS_00_part1_entry-3.wav
...
2016-07-21 04:12:48