What an Endless Conversation with Werner Herzog Can Teach Us about AI

2023-01-18 15:04:36

关注

On the website Infinite Conversation, the German filmmaker Werner Herzog and the Slovenian philosopher Slavoj Žižek are having a public chat about anything and everything. Their discussion is compelling, in part, because these intellectuals have distinctive accents when speaking English, not to mention a tendency toward eccentric word choices. But they have something else in common: both voices are deepfakes, and the text they speak in those distinctive accents is being generated by artificial intelligence.

I built this conversation as a warning. Improvements in what’s called machine learning have made deepfakes—incredibly realistic but fake images, videos or speech—too easy to create, and their quality too good. At the same time, language-generating AI can quickly and inexpensively churn out large quantities of text. Together, these technologies can do more than stage an infinite conversation. They have the capacity to drown us in an ocean of disinformation.

Machine learning, an AI technique that uses large quantities of data to “train” an algorithm to improve as it repetitively performs a particular task, is going through a phase of rapid growth. This is pushing entire sectors of information technology to new levels, including speech synthesis, systems that produce utterances that humans can understand. As someone who is interested in the liminal space between humans and machines, I’ve always found it a fascinating application. So when those advances in machine learning allowed voice synthesis and voice cloning technology to improve in giant leaps over the past few years—after a long history of small, incremental improvements—I took note.

Infinite Conversation got started when I stumbled across an exemplary speech synthesis program called Coqui TTS. Many projects in the digital domain begin with finding a previously unknown software library or open-source program. When I discovered this tool kit, accompanied by a flourishing community of users and plenty of documentation, I knew I had all the necessary ingredients to clone a famous voice.

As an appreciator of Werner Herzog’s work, persona and worldview, I’ve always been drawn by his voice and way of speaking. I’m hardly alone, as pop culture has made Herzog into a literal cartoon: his cameos and collaborations include The Simpsons, Rick and Morty and Penguins of Madagascar. So when it came to picking someone’s voice to tinker with, there was no better option—particularly since I knew I would have to listen to that voice for hours on end. It’s almost impossible to get tired of hearing his dry speech and heavy German accent, which convey a gravitas that can’t be ignored.

Building a training set for cloning Herzog’s voice was the easiest part of the process. Between his interviews, voice-overs and audiobook work there are literally hundreds of hours of speech that can be harvested for training a machine-learning model—or in my case, fine-tuning an existing one. A machine-learning algorithm’s output generally improves in “epochs,” which are cycles through which the neural network is trained with all the training data. The algorithm can then sample the results at the end of each epoch, giving the researcher material to review in order to evaluate how well the program is progressing. With the synthetic voice of Werner Herzog, hearing the model improve with each epoch felt like witnessing a metaphorical birth, with his voice gradually coming to life in the digital realm.

Once I had a satisfactory Herzog voice, I started working on a second voice and intuitively picked Slavoj Žižek. Like Herzog, Žižek has an interesting, quirky accent, a relevant presence within the intellectual sphere and connections with the world of cinema. He has also achieved somewhat popular stardom, in part thanks to his polemical fervor and sometimes controversial ideas.

At this point, I still wasn’t sure what the final format of my project was going to be—but having been taken by surprise by how easy and smooth the whole process of voice-cloning was, I knew it was a warning to anyone who would pay attention. Deepfakes have become too good and too easy to make; just this month, Microsoft announced a new speech synthesis tool called VALL-E that, researchers claim, can imitate any voice based on just three seconds of recorded audio. We’re about to face a crisis of trust, and we’re utterly unprepared for it.

In order to emphasize this technology’s capacity to produce large quantities of disinformation, I settled on the idea of a never-ending conversation. I only needed a large language model—fine-tuned on texts written by each of the two participants—and a simple program to control the back-and-forth of the conversation, so that its flow would feel natural and believable.

At their very core, language models predict the next word in a sequence, given a series of words already present. By fine-tuning a language model, it is possible to replicate the style and concepts that a specific person is likely to speak about, provided that you have abundant conversation transcripts for that individual. I decided to use one of the leading commercial language models available. That’s when it dawned on me that it’s already possible to generate a fake dialogue, including its synthetic voice form, in less time than it takes to listen to it. This provided me with an obvious name for the project: Infinite Conversation. After a couple of months of work, I published it online last October. The Infinite Conversation will also be displayed, starting February 11, at the Misalignment Museum art installation in San Francisco.

Once all the pieces fell into place, I marveled at something that hadn’t occurred to me when I started the project. Like their real-life personas, my chatbot versions of Herzog and Žižek converse often around topics of philosophy and aesthetics. Because of the esoteric nature of these topics, the listener can temporarily ignore the occasional nonsense that the model generates. For example, AI Žižek’s view of Alfred Hitchcock alternates between seeing the famous director as a genius and as a cynical manipulator; in another inconsistency, the real Herzog notoriously hates chickens, but his AI imitator sometimes speaks about the fowl compassionately. Because actual postmodern philosophy can read as muddled, a problem Žižek himself noted, the lack of clarity in the Infinite Conversation can be interpreted as profound ambiguity rather than impossible contradictions.

This probably contributed to the overall success of the project. Several hundred of the Infinite Conversation’s visitors have listened for over an hour, and in some cases people have tuned in for much longer. As I mention on the website, my hope for visitors of the Infinite Conversation is that they not dwell too seriously on what is being said by the chatbots, but gain awareness of this technology and its consequences; if this AI-generated chatter seems plausible, imagine the realistic-sounding speeches that could be used to tarnish the reputations of politicians, scam business leaders or simply distract people with misinformation that sounds like human-reported news.

But there is a bright side. Infinite Conversation visitors can join a growing number of listeners who report that they use the soothing voices of Werner Herzog and Slavoj Žižek as a form of white noise to fall asleep. That’s a usage of this new technology I can get into.

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.

参考译文

与Werner Herzog的无尽对话可以教会我们关于AI的知识

在网站"Infinite Conversation"上，德国导演维尔纳·赫尔佐格和斯洛文尼亚哲学家斯拉沃伊·齐泽克正在进行一场公开的天南地北的对话。他们的对话之所以引人入胜，其中一个原因在于这两位知识分子在说英语时都有独特的口音，更不必说他们选择词汇时那种偏爱古怪表达的倾向了。但他们之间还有一个共同点：这两人的声音都是深度伪造的，他们以独特口音说出的那些话语，是由人工智能生成的。我构建这场对话是一种警示。随着一种被称为“机器学习”的技术不断取得进步，深度伪造技术——那些极其逼真却虚假的图像、视频或语音——变得越来越容易制作，其质量也高得惊人。与此同时，语言生成型人工智能可以快速且低成本地大量生成文本。结合这两项技术，它们所能做到的不只是构建一场无限的对话，还足以将我们淹没在虚假信息的海洋中。机器学习是一种人工智能技术，它通过大量数据来“训练”算法，以在重复执行某项任务时逐步提升表现。这项技术正处于飞速发展的阶段，正在将整个信息科技行业推向新的高度，包括语音合成——那些能生成人类可理解语言的系统。作为一个对人类与机器之间交界地带感兴趣的人，我一直觉得这是一种非常迷人的应用。因此，当几年前机器学习技术的突破性进展让语音合成和语音克隆技术取得了巨大飞跃（此前多年的发展主要是缓慢、渐进式的改进）时，我立即注意到了这一点。Infinite Conversation的创意源于我偶然发现了一个出色的语音合成程序Coqui TTS。许多数字领域项目都是从发现某种此前未被广泛认知的软件库或开源程序开始的。当我自己发现了这个工具包，看到其有一个活跃的用户社区和丰富的文档资源时，我意识到，我已经拥有了克隆著名声音所需的所有要素。作为一个欣赏维尔纳·赫尔佐格作品、个性和世界观的人，我一直被他的声音和说话方式所吸引。我并不是唯一一个这样的人，流行文化已经将赫尔佐格变成了一种卡通式人物：他曾在《辛普森一家》《瑞克和莫蒂》《马达加斯加企鹅》等节目中有客串或合作。因此，当我开始考虑要复制谁的声音时，赫尔佐格无疑是最佳人选——尤其是考虑到我必须长时间听这个声音。几乎不可能厌倦于听到他那干练又带有浓厚德国口音的讲话方式，这种语气所展现出的庄重和威严让人无法忽视。为赫尔佐格的声音构建训练数据集是整个流程中最简单的一部分。他有大量访谈、旁白和有声书作品，单是这些内容就提供了数百小时的录音，可以作为机器学习模型的训练材料——在本例中，我是在一个已有模型基础上进行微调。机器学习算法的输出通常会通过“训练周期（epochs）”逐渐提升表现，每个周期都会让神经网络使用全部训练数据进行一次学习。算法可以在每个周期结束时采样生成结果，供研究者回顾，以评估模型的进展。当我听到赫尔佐格合成声音在每个训练周期后逐渐变得越来越好时，这种感觉就像是目睹一种隐喻意义上的诞生，他的声音在数字世界中逐步鲜活起来。当我已经拥有了一个令人满意的赫尔佐格声音后，我开始构建第二个声音，并直觉地选择了斯拉沃伊·齐泽克。与赫尔佐格一样，齐泽克也有着有趣且古怪的口音，他在思想界有着重要的存在，并与电影界有联系。此外，他某种程度上也是一位大众明星，这得益于他的辩论热情和有时引发争议的观点。当时，我还没有想好这个项目最终的形式会是什么样子——但当我被语音克隆过程的简便与顺畅所震惊时，我知道这应该是一种警告，给那些愿意倾听的人。深度伪造技术已经变得过于逼真，也过于容易制作了；就在本月，微软宣布推出一种名为VALL-E的新语音合成工具，研究人员声称，该工具仅需三秒钟的录音，就能模仿任何人的声音。我们即将面临一场信任危机，而我们对此却毫无准备。为了强调这种技术制造大量虚假信息的能力，我决定采用“无限对话”这一形式。我只需要一个经过微调的大语言模型——分别基于两位参与者的文字材料进行训练——以及一个简单的程序来控制对话的来回进行，使对话的流程自然且可信。从根本上说，语言模型的功能在于根据已有的词序列预测下一个词。通过微调语言模型，可以模拟一个人通常会表达的风格和概念，前提是拥有该人大量的对话记录。我决定使用目前市面上领先的商业语言模型之一。就在这时，我才真正意识到，现在只需要比听一遍对话还要短的时间，就可以生成包括声音在内的完整虚假对话。这让我自然地想到了这个项目的名称：Infinite Conversation（无限对话）。经过几个月的工作后，我在去年十月将这个项目发布到了网上。从2月11日起，“无限对话”也将作为旧金山“Misalignment Museum”艺术展览的一部分展出。当所有要素最终就位后，我惊讶地意识到一件自己在项目开始时未曾预料到的事情。就像他们真实生活中的人物那样，我的赫尔佐格和齐泽克聊天机器人经常围绕哲学和美学等话题展开对话。由于这些话题本身具有深奥性，听者可以暂时忽略模型偶尔生成的无意义内容。例如，AI版的齐泽克对阿尔弗雷德·希区柯克的看法会在天才与讽刺操纵者之间来回变换；在另一个矛盾之处，真实的赫尔佐格以厌恶鸡类闻名，但他的AI模仿者有时却会带着同情谈论这些家禽。由于后现代哲学本身就常显得模糊不清，齐泽克本人也承认这一点，因此“无限对话”中缺乏清晰度的内容可以被解读为深奥的模棱两可，而不是荒谬的矛盾。这或许正是促使该项目取得成功的重要因素。截至目前，已有数百位“无限对话”的访客听过超过一个小时的对话，一些人甚至聆听了更长时间。正如我在网站上所提到的，我希望访客们不要过于认真地去听聊天机器人说了什么，而是意识到这项技术及其后果；如果这种由AI生成的对话听起来还蛮可信的，那么想象一下那些听起来像真人播报的虚假演讲吧，它们可能被用来损害政治人物的声誉、欺骗商界领袖，或者仅仅是为了让人们被听起来真实可信的假新闻分散注意力。但事情也有积极的一面。一些“无限对话”的访客表示，他们将赫尔佐格和齐泽克那令人放松的声音用作白噪音来帮助入睡。这种新技术的用途，是我可以欣然接受的。这是一篇观点与分析类文章，作者表达的观点并不一定代表《科学美国人》（Scientific American）的观点。

您觉得本篇内容如何

评分

声明：本文内容及配图源自互联网收集，目的在于传递更多信息，并不代表本网赞同其观点或证实其内容真实性，不承担此类作品侵权行为的直接责任及连带责任。如涉及作品内容、版权等问题，请联系本网处理，侵权内容将在一周内下架整改。

您需要登录才可以回复登录|注册

提交评论

scientific

这家伙很懒，什么描述也没留下

期刊文献

期刊订阅

免费订阅

新利18国际娱乐邮件期刊为您提供业界最新最快的技术应用与市场资讯

scientific

这家伙很懒，什么描述也没留下

关注

评论
喜欢
点赞
分享

点击进入下一篇

IDS Imaging：为什么行业没有充分利用人工智能？

提取码

复制提取码

点击跳转至百度网盘

取消确认

What an Endless Conversation with Werner Herzog Can Teach Us about AI

评论

热门资讯

scientific

期刊文献

基于改进的RBF神经网络倾角传感器温度补偿方法研究

柔性穿戴技术应用于校园运动心脏骤停的可行性分析

ＭＥＭＳ微热板结构设计与仿真

基于霍尔脉宽的汽车天窗防夹标定系统设计

振动筒传感器自动增益谐振电路仿真设计和测试

基于ＡｇＮＷｓ＠丙烯酸酯弹性体的柔性应变传感器

期刊订阅

最新文章

中国MEMS芯片第一股，暴增172%！

服务升级！第五届“感知领航”获奖权益公开！

清华精仪系博士获感动中国年度人物，从事光纤传感技术研究！

终于看到了，这是我见过写传感器产业链最细致的文章！（强推）

NMP检测“简”法：一步合规，安全产能双保障

相关阅读

传感器应该推进人工智能实现整体进化

华为首款AI音箱：可通过HiLink开放协议控制19个家电品类

本田将在CES展出自动驾驶作业车和机器人新品

日本新研究：人工智能或能提前一周预测台风

人工智能在各领域改变着人们的工作和生活方式

非常廉价！美国初创公司为自动驾驶汽车创建地图

人工智能监测上海公交司机疲劳驾驶

英伟达推出自动驾驶新组件可防止绝大多数碰撞事件

简单的温湿度监控意义不大，那智慧农业该如何突破？

国外开发出一款可以倒咖啡叠毛巾的机器人

scientific

点击进入下一篇

What an Endless Conversation with Werner Herzog Can Teach Us about AI

评论

热门资讯

scientific

期刊文献

​基于改进的RBF神经网络倾角传感器温度补偿方法研究

柔性穿戴技术应用于校园运动心脏骤停的可行性分析

ＭＥＭＳ微热板结构设计与仿真

基于霍尔脉宽的汽车天窗防夹标定系统设计

振动筒传感器自动增益谐振电路仿真设计和测试

基于ＡｇＮＷｓ＠丙烯酸酯弹性体的柔性应变传感器

期刊订阅

最新文章

中国MEMS芯片第一股，暴增172%！

服务升级！第五届“感知领航”获奖权益公开！

清华精仪系博士获感动中国年度人物，从事光纤传感技术研究！

终于看到了，这是我见过写传感器产业链最细致的文章！（强推）

NMP检测“简”法：一步合规，安全产能双保障

相关阅读

传感器应该推进人工智能实现整体进化

华为首款AI音箱：可通过HiLink开放协议控制19个家电品类

本田将在CES展出自动驾驶作业车和机器人新品

日本新研究：人工智能或能提前一周预测台风

人工智能在各领域改变着人们的工作和生活方式

非常廉价！美国初创公司为自动驾驶汽车创建地图

人工智能监测上海公交司机疲劳驾驶

英伟达推出自动驾驶新组件 可防止绝大多数碰撞事件

简单的温湿度监控意义不大，那智慧农业该如何突破？

国外开发出一款可以倒咖啡叠毛巾的机器人

scientific

点击进入下一篇

基于改进的RBF神经网络倾角传感器温度补偿方法研究

英伟达推出自动驾驶新组件可防止绝大多数碰撞事件