小程序
传感搜
传感圈

What would a ‘multi-modal’ GPT-4 mean for businesses?

2023-03-11 21:41:03
关注

Rumours have surrounded the size, performance and abilities of GPT-4, the next-generation large language model from OpenAI, since the company launched GPT-3 in June 2020. This has only intensified since the unexpected success of ChatGPT and the latest rumour comes from Microsoft in Germany suggesting the tool will be able to analyse and produce more than just text. This could allow users to turn an organisational chart into a text report, or create a mood board from a video.

OpenAI released GPT-3 in 2020 and published a refined version in GPT-3.5 last year. (Photo by Laylistique/Shutterstock)

Microsoft is a major partner for OpenAI, investing billions in the start-up since 2019 and utilising its models in a range of products. Speaking at an event in Germany, Andreas Braun, CTO at Microsoft Germany said GPT-4 was coming next week and “will have multimodal models that will offer completely different possibilities – for example, videos”.

It is also rumoured that the model will be of a similar size or smaller than the 175 billion-parameter GPT-3 due to improved optimisation and efficiency efforts. If true this will see OpenAI follow a trend set by Meta with its LLaMA model and AI21 Labs with Jurassic-2. A long-standing rumour that it will have more than 100 trillion parameters has been debunked by OpenAI founder Sam Altman.

If, as Braun suggests, the next generation of OpenAI’s flagship large language model is multimodal it could prove to be a revolutionary technology as it would be able to analyse and generate video, images and possibly audio as well. It could be used to produce multimedia output and take inputs from a range of different forms of media.

Multimodal models are nothing new. OpenAI’s own DALL-E is a form of multimodal AI, trained on both text and images to allow for text-to-image or image-to-image generation. CLIP is another OpenAI model developed to associate visual concepts with language. It is trained to distinguish between similar and dissimilar inputs by maximising the agreement between them.

It can be used for image classification, object detection and image retrieval. CLIP can also be used for zero-shot learning which is the ability to perform a task without any prior training or example. Microsoft itself has been experimenting with multi-modal AI models already, and earlier this month released details of Kosmos-1, a model which can draw on data from text and images.

Multi-modal AI: multimedia input and output

Very little specific information has been revealed about GPT-4 beyond the fact it will likely outperform the hugely successful GPT-3 and its interim successor GPT-3.5, which is a fine-tuned version of the original model. The comments from Microsoft Germany suggest multi-modality, which could be anything from accepting image or video inputs, to being able to produce a movie.

James Poulter, CEO of Voice AI company Vixen Labs says the former is most likely. “If GPT-4 becomes multi-modal in this way it opens up a whole load of new use cases. For example being able to summarise long-form audio and video like podcasts and documentaries, or being able to extract meaning and patterns from large databases of photos and provide answers about what they contain.”

Content from our partners

Addressing ESG to build a better, more sustainable business 

Addressing ESG to build a better, more sustainable business 

Empower finance leaders to become agents of change

Empower finance leaders to become agents of change

Why the fashion industry must leverage tech to unlock supply chain visibility 

Why the fashion industry must leverage tech to unlock supply chain visibility 

Many of the big LLM providers are looking at ways to integrate their models with other tools such as knowledge graphs, generative AI models and multimodal outputs but Poulter says “the speed in which OpenAI has scaled the adoption of ChatGPT and GPT3.5 puts it way out in front in terms of enterprise and consumer trust.”

View all newsletters Sign up to our newsletters Data, insights and analysis delivered to you By The Tech Monitor team

One of the most likely use cases for multimedia input is in speech recognition or automatic transcription of audio or video, predicts AI developer Michal Stanislawek. This will build on the recently released Whisper API that can quickly transcribe speech into text and synthetic voice generation. “I hope that this also means being able to send images and possibly videos and continue conversation based on their contents,” he says.

“Multi-modality will be a huge change in how people utilise AI and what new use cases it can support. Entire companies will be built based on it,” adds Stanislawek, giving the example of synthetic commentators for sports games in multiple languages, summarising real-time meetings and events and analysing graphs to extract more meaning.

Will GPT-4 be truly multi-modal?

Conversational AI expert Kane Simms agrees, adding that multi-modal input rather than output is the most likely, but that if it is output-based then “you’re in interesting territory,” suggesting it could be used to generate a video from an image and audio file or create a “mood board” from a video.

However, Mark L’Estrange, a senior lecturer in e-sports at Falmouth University’s Games Academy told Tech Monitor it is unlikely to be true multi-modal in the true sense of the word as that requires much more development and compute power. “Multi-modal means that you can give it verbal cues, you can upload pictures, you can give it any input whatsoever and it understands it and in context produces anything you want,” he says, adding “right now we have a very fractured framework.”

He said that will come, describing it as ‘universal-modal’, where you could, through a series of inputs and prompts, generate something like a game prototype that can then be worked up into a full game using human input and talent. “The human input is what’s required to make these unique games that have these unique visions and to choose the right outputs from the AI. So maybe a team that was 40 or 50 people before would now be 20 people.”

Even if it is only partially multi-modal, able to take a simple image input and generate a text report, this could be significant for enterprise. It would allow a manager to submit a graph of performance metrics across different software options and have the AI generate a full report, or a CEO send an organisation chart and have the AI suggest optimisations and changes for best performance.

Read more: OpenAI’s ChatGPT is giving the rest of the world AI FOMO

Topics in this article : AI , Microsoft , OpenAI

参考译文
“多模式”GPT-4对企业意味着什么?
自从OpenAI在2020年6月推出GPT-3以来,关于其下一代大型语言模型GPT-4的规模、性能和能力的传闻就层出不穷。随着ChatGPT的意外成功,这一趋势更加明显。最新的传闻来自德国的微软,称该工具将能够分析并生成不仅仅是文本的内容。这将允许用户将组织结构图转换为文本报告,或从视频中创建情绪板(mood board)。OpenAI于2020年推出了GPT-3,并于去年发布了改进版本GPT-3.5。(图片由Laylistique/Shutterstock提供)微软是OpenAI的重要合作伙伴,自2019年以来已投资数十亿美元,并在其众多产品中使用了OpenAI的模型。在德国的一场活动上,微软德国技术与创新负责人安德烈亚斯·布劳恩(Andreas Braun)表示,GPT-4将在下周发布,并“将具备多模态模型,这将带来全新的可能性,例如视频处理”。据传该模型的规模可能与1750亿参数的GPT-3相当或更小,因为优化和效率提升工作取得了进展。如果属实,OpenAI将追随Meta的LLaMA模型和AI21 Labs的Jurassic-2所开启的趋势。长期流传的“GPT-4将拥有超过100万亿个参数”的传言已被OpenAI创始人萨姆·阿尔特曼(Sam Altman)驳斥。如果,正如布劳恩所说,OpenAI下一代旗舰大型语言模型具备多模态能力,这可能将成为一场革命性的技术,因为它将能够分析并生成视频、图像,甚至可能处理音频。它可用于生成多媒体输出,并从各种媒体形式中获取输入。多模态模型并不是什么新鲜事物。OpenAI自身的DALL·E就是一种多模态人工智能,它同时基于文本和图像进行训练,可以生成文本到图像或图像到图像的内容。CLIP是另一种由OpenAI开发的模型,旨在将视觉概念与语言联系起来。它通过最大化输入之间的相似性进行训练,能够进行图像分类、目标检测和图像检索。CLIP也可用于零样本学习,即在没有先前训练或示例的情况下执行任务。微软本身也在探索多模态AI模型,并于本月早些时候发布了Kosmos-1的详细信息,这是一种能够从文本和图像中提取数据的模型。多模态AI:多媒体输入与输出。除了GPT-4将大大优于非常成功的GPT-3及其临时继任者GPT-3.5(这是原始模型的微调版本)之外,几乎没有关于该模型的具体信息被公开。微软德国的评论暗示了多模态功能,这可能包括接受图像或视频输入,甚至生成电影。人工智能语音公司Vixen Labs的首席执行官詹姆斯·普尔特(James Poulter)表示,前者是最有可能的。“如果GPT-4以这种方式具备多模态功能,它将开启大量新用例。例如,可以总结长篇音频和视频,如播客和纪录片,或从大量照片数据库中提取含义和模式,并提供关于它们内容的解答。” 我们的合作伙伴内容:如何解决环境、社会和治理(ESG),打造一个更好、更可持续的企业赋能财务领导者成为变革的推动者时尚产业为何必须利用技术实现供应链可视化许多大型大型语言模型(LLM)提供商都在探索如何将其模型与知识图谱、生成式AI模型和多模态输出集成,但Poulter表示:“OpenAI在ChatGPT和GPT-3.5的采用速度上远远领先,因此在企业界和消费者信任方面处于领先地位。”订阅所有通讯注册我们的通讯,数据、洞察和分析直达您手中由《科技观察》团队提供点击此处注册人工智能开发者米哈尔·斯坦斯拉韦克(Michal Stanislawek)预测,多媒体输入最可能的用例之一是语音识别或自动音频/视频转录。这将建立在最近发布的Whisper API之上,该API可以快速将语音转录为文本,并生成合成语音。“我希望这也意味着可以发送图片,甚至视频,并基于其内容继续对话,”他说。“多模态将彻底改变人们使用AI的方式,以及它能够支持哪些新的用例。许多公司也将基于此建立。”斯坦斯拉韦克补充道,并举了例子,例如为体育比赛提供多语言合成评论员、总结实时会议和活动,以及分析图表以提取更多信息。GPT-4会是真正多模态的吗?对话式人工智能专家凯恩·西姆斯(Kane Simms)表示同意,并补充说输入的多模态比输出更有可能,但如果输出是多模态的,那将“进入有趣的领域”,这可能意味着从图像和音频文件生成视频,或从视频中创建“情绪板”。然而,法尔茅斯大学电子竞技学院的高级讲师马克·勒斯特兰奇(Mark L’Estrange)告诉《科技观察》,它不太可能在字面意义上成为真正意义上的多模态,因为这需要更多的开发和计算能力。“多模态意味着你可以给它语音提示,上传图片,给予任何输入,它都能理解,并根据上下文生成任何你想要的内容,”他说,并补充道:“目前,我们的框架还很分散。”他预测这种真正意义上的多模态终将到来,并将其称为“全模态”,在这种情况下,通过一系列输入和提示,你可以生成一个游戏原型,然后由人类通过输入和才能将其扩展为完整的游戏。“人类的输入是创造独特游戏、展现独特愿景以及从AI中选择合适输出的关键。因此,以前需要40到50人团队完成的工作,现在可能只需要20人就能完成。”即使GPT-4只具备部分多模态功能,例如接受简单图像输入并生成文本报告,对于企业来说也将意义重大。它将允许经理上传不同软件选项的性能指标图表,让AI生成完整报告;或者CEO发送组织结构图,让AI建议优化和改进以实现最佳表现。了解更多:OpenAI的ChatGPT正在引发全球人工智能的焦虑 本文涉及的主题:人工智能,微软,OpenAI
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告
提取码
复制提取码
点击跳转至百度网盘