43、深圳清华、腾讯AI Lab、程鹏实验室:DreamDiffusion将人类大脑所想生成高质量图像[盗梦空间实现第一步:思想具现化]

本篇文章由深圳清华、腾讯AI Lab、程鹏实验室2023年6月30日共同发表于<Computer Science >文章提出的DreamDiffusion能够直接从脑电图(EEG)信号中生成高质量的图像,而无需将思想转换为文本,在与基线模型对比中图像完整性、可读性均最佳。该模型和研究方向有助于人类转瞬即逝的奇思妙想具象化,有助于艺术的发展,并对于儿童的孤独症、语言障碍等疾病具有心理辅助治疗的前景。

文章地址:[2306.16934] DreamDiffusion: Generating High-Quality Images from Brain EEG Signals (arxiv.org)

模型代码:GitHub - bbaaii/DreamDiffusion: Implementation of “DreamDiffusion: Generating High-Quality Images from Brain EEG Signals”


This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text. DreamDiffusion leverages pretrained text-to-image models and employs temporal masked signal modeling to pre-train the EEG encoder for effective and robust EEG representations. Additionally, the method further leverages the CLIP image encoder to provide extra supervision to better align EEG, text, and image embeddings with limited EEG-image pairs. Overall, the proposed method overcomes the challenges of using EEG signals for image generation, such as noise, limited information, and individual differences, and achieves promising results. Quantitative and qualitative results demonstrate the effectiveness of the proposed method as a significant step towards portable and low-cost “thoughts-to-image”, with potential applications in neuroscience and computer vision.



Image generation [16, 22, 4] has made great strides inrecent years, especially after breakthroughs in text-to-image generation [31, 12, 30, 34, 1]. The recent text-to-image generation not only dramatically improves the quality of generated images, but also enables the creation of people’s ideas into exquisite paintings and artworks controlled by text. We are very curious whether we could control image creation directly from brain activities (such as electroencephalogram (EEG) recordings), without translating our thoughts into text before creation. This kind of “thoughts-to-images” has broad prospects and could broaden people’s imagination. For example, it can greatly improve the efficiency of artistic creation and help capture those fleeting inspirations. It also has the potential to help us visualize our dreams at night, (which inspires the name DreamDiffusion). Moreover, it may even aid in psychotherapy, having the potential to help children with autism and those with language disabilities.



our goal of using brain signals to create conveniently and efficiently.

1) Since fMRI equipment is not portable and needs to be operated by professionals, it is difficult to capture fMRI signals.

2) The cost of fMRI acquisition is high. They greatly hinder the widespread use of this method in the practical artistic generation. In contrast, EEG (electroencephalogram) is a non-invasive and low-cost method of recording electrical activity in the brain. Portable commercial products are now available for the convenient acquisition of EEG signals, showing great potential for future
art generation.


In this work, we aim to leverage the powerful generative capabilities of pre-trained text-to-image models (i.e., Stable Diffusion [32]) to generate high-quality images directly from brain EEG signals. However, this is non-trivial and has two challenges.

1) EEG signals are captured non-invasively and thus are inherently noisy. In addition, EEG data are limited and individual differences cannot be ignored. How to obtain effective and robust semantic representations from EEG signals with so many constraints?

2) Thanks to the use of CLIP [28] and the training on a large number of textimage pairs, the text and image spaces in Stable Diffusion are well aligned. However, the EEG signal has its own characteristics, and its space is quite different from that of text and image. How to align EEG, text and image spaces with limited and noisy EEG-image pairs?



To address the first challenge, we propose to train EEG representations using large amounts of EEG data instead of only rare EEG-image pairs. Specifically, we adopt masked signal modeling to predict the missing tokens based on contextual cues. Different from MAE [18] and MinD-Vis [7], which treat inputs as two-dimensional images and mask the spatial information, we consider the temporal characteristics of EEG signals, and dig deep into the semantics behind temporal changes in people’s brains. We randomly mask a proportion of tokens and then reconstruct those masked ones in the time domain. In this way, the pre-trained encoder learns a deep understanding of EEG data across different people and various brain activities.



As for the second challenge, previous methods [40, 7]usually directly fine-tune Stable Diffusion (SD) models using a small number of noisy data pairs. However, it is difficult to learn accurate alignment between brain signals (e.g., EEG and fMRI) and text spaces by end-to-end finetuning SD only using the final image reconstruction loss.We thus propose to employ additional CLIP [28] supervision to assist in the alignment of EEG, text, and image spaces. Specifically, SD itself uses CLIP’s text encoder to generate text embeddings, which are quite different from the masked pre-trained EEG embeddings in the previous stage. We leverage CLIP’s image encoder to extract rich image embeddings that align well with CLIP text embeddings. Those CLIP image embeddings are then used to fur ther optimize EEG embedding representations. Therefore,
the refined EEG feature embeddings can be well aligned with the CLIP image and text embeddings, and are more suitable for SD image generation, which in turn improves the quality of generated images.



Our contributions can be summarized as follows.

1) We propose DreamDiffusion, which leverages the powerful pre-trained text-to-image diffusion models to generate realistic images from EEG signals only. It is a further step towards portable
and low-cost “thoughts-to-images”.

2) A temporal masked signal modeling is employed to pre-train EEG encoder for effective and robust EEG representations.

3) We further leverage the CLIP image encoder to provide extra supervision to better align the EEG, text, and image embeddings with limited EEG-image pairs.

4) Quantitative and qualitative results have shown the effectiveness of our DreamDiffusion.


Pre-training models have become increasingly popular in the field of computer vision, with various self-supervised learning approaches focusing on different pretext tasks [13,
43, 26]. These methods often utilize pretext tasks such as contrastive learning [2, 17], which models image similarity and dissimilarity, or autoencoding [6], which recovers the original data from a masked portion. In particular, masked signal modeling (MSM) has been successful
in learning useful context knowledge for downstream tasks by recovering the original data from a high mask ratio for visual signals [18, 44] and a low mask ratio for natural languages [10, 29]. Another recent approach, CLIP [28], builds a multi-modal embedding space by pre-training on
400 million text-image pairs collected from various sources on the Internet. The learned representations by CLIP are extremely powerful, enabling state-of-the-art zero-shot image
classification on multiple datasets, and providing a method to estimate the semantic similarity between text and images.


Diffusion models:

Our method comprises three main components:

1)masked signal pre-training for an effective and robust EEG encoder

2) fine-tuning with limited EEG-image pairs with pre-trained Stable Diffusion

3) aligning the EEG, text, and image spaces using CLIP encoders. Firstly, we leverage masked signal modeling with lots of noisy EEG data to train an EEG encoder to extract contextual knowledge. The resulting EEG encoder is then employed to provide conditional features for Stable Diffusion via the cross-attention mechanism. In order to enhance the compatibility of EEG features with Stable Diffusion, we further align the EEG,text, and image embedding spaces by reducing the distance between EEG embeddings and CLIP image embeddings during fine-tuning. As a result, we can obtain DreamDiffusion, which is capable of generating high-quality images from EEG signals only.







  • 15
  • 19
    觉得还不错? 一键收藏
  • 打赏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则




¥1 ¥2 ¥4 ¥6 ¥10 ¥20



钱包余额 0


