add Digress

2024-07-15 15:18:12 +02:00 · 2024-07-15 15:18:12 +02:00 · 390c06a6b2
commit 390c06a6b2
parent ad37f7a57d
1 changed files with 399 additions and 10 deletions
--- a/paper_log.md
+++ b/paper_log.md
@ -1,5 +1,9 @@
 图片/文字/公式都是一条线，已经能把method experiment表达清楚
 [toc]
 ## Writing
 1. introduction 1- 3
@ -21,7 +25,8 @@
 3. method
   1. 要有一个pipeline，从头到尾有一个框架，input/output是什么维度的？
   2. 图上的东西在公式上应该有体现，看图/看文字/看公式都能对应起来，删了某一套，看其他的可能不清晰？
-4. experiment
+4. ###### experiment
   1. 跑通
   2. 看在各种数据集上能不能跑？
   3. 做出来result之后，为什么要选定这些超参数? role a hyperparameter
@ -55,6 +60,7 @@ graph dit是用来多条件分子生成用的
 1. **Condition encoder** to learn the representation of numerical and categorical properties
   条件编码器来学习数字和分类属性的表示
 2. Utilises a **Transformer-based graph denoiser** to achieve molecular graph denoising under conditions.
   使用一个基于transformer的图解噪器来达到有条件的分子图解噪
 3. propose a graph-dependent noise model for training Graph DiT
   提出一个基于图的噪声模型来训练GraphDiT，之前是针对原子和键在前向diffusion过程中分别添加噪声。
@ -148,11 +154,7 @@ RQ3，消融实验，把AdaLN和Cluster换掉看看
 ![image-20240707183420457](http://ff.mhrooz.xyz:4080/2024/07/07/e33e73f62ea3b.png)
 ### words
 gas permeability 气体渗透率
 conformation 构象
 ## Lift Your Molecules 2406.10513
@ -297,6 +299,14 @@ $$
 ## CoTracker: It is Better to Track Together 2307.07635
 ### code
 predictor中的class CotrackerPredictor将会调用build_cotracker()来创建一个位于cotracker.py中的CoTracker2(nn.Module).
 CotrackerPredictor.\_compute\_sparse\_tracks()将会预测tracks以及visibilities
 ### key research question
 optical flow
@ -333,6 +343,217 @@ conv nn提取d维特征，分辨率降低4倍
 ![image-20240708233559704](http://ff.mhrooz.xyz:4080/2024/07/08/ea2fb004c88ba.png)
 ## diffusionNAG 2305.16943
 ### key research question
 repetitive sampling and training of many task-irrelvant architectures.
 > sufferfrom the high search cost
 >
 > proposed to utilize parameterized property predictors without training
 >
 > NAS waste of a time to explore an extensive search space
 >
 > the property predictors mostly play a passive role such as evaluators that rank architecture candidates provided by a search strategy to simply filter them out during the search process.
 ### Innovation Point
 #### Abstract
 1. proposed a novel conditional **Neural Architecture Generation(NAG)** framework based on diffusion models
 2. the guidance of parameterized predictors => task optimal architectures => sampling from a  region that is more likely to satisfy the properties.
 #### Introduction
 1. diffusion generative models.
 2. **train** the base diffusion generative model **without requiring expensive label information**. e.g. accuracy
 3. deploy the trained diffusion model to diverse downstream tasks, while controlling the generation process with **property predictors**.
 4. we leverage **gradients** of **parameterized predictors** to guide the generative model toward the space of architectures with desired properties.
 5. our approach facilitates efficient search by **generating architectures that follow the specific distribution of interest within the search space**
 6. utilizes the predictor for both NAG and evaluation purposes
 7.  we can swap out the predictors in a plug-and-play manner without retraining the base generative model
 8. design a score network for na
 9. previously, na represented as directed acyclic graphs to model the computation flow, now undirected graphs, to represent  structure information of graphs completely ignoring the directional relationships between nodes.  introduce a score network that encodes the **positional information** of nodes to capture t**heir order connected by directed edges.**
 Summary:
 先训练一个生成式diffusion出来，然后生成的时候（解噪的时候）把预测器加进来，根据任务不同训练预测器，从而生成高质量的网络。
 ### Method
 - Neural Architecture $\mathbf{A}$ with $N$ nodes defined by **operator type matrix** $\mathbf{\mathcal{V}}\in\R^{N\times F}$ and upper triangular adjacency matrix $\mathbf{\mathcal{E}}$, $\mathbf{A}=(\mathbf{\mathcal{V}},\mathbf{\mathcal{E}})$,  $F$ is the number of predefined operator sets
 - Search Space $\mathcal{A}$
 - Score Network $s_\theta$
 - desired property $y$
 ##### forward process
 $$
 d\mathbf{A}_t = \mathbf{f}_t(\mathbf{A}_t)\text{d}t+g_t\text{d}\mathbf{w}
 $$
 $ \mathbf{f}_t$是一个线性的drift系数，从一个$\mathcal{A}$指向另一个$\mathcal{A}$.
 $ \mathbf{g}_t$是一个线性的drift系数，从一个$\mathcal{A}$指向另一个$\R$.
 ##### Reverse process
 $$
 d\mathbf{A}_t = [\mathbf{f}_t(\mathbf{A}_t)-g_t^2\nabla{A_t}\log p_t(A_t)]\text{d}\bar t + g_t\text{d}\bar{\mathbf{w}}
 $$
 we discretize the entries of the architecture matrices usingthe operator 1>0.5 to obtain discrete 0-1 matrices after generating samples by simulating the reverse diffusion process. 
 ![image-20240713184159898](http://ff.mhrooz.xyz:4080/2024/07/13/0081ae053d589.png)
 来近似$\nabla{A_t}\log p_t(A_t)$
 ##### score network
 goal/object: dependency between nodes 2. accurate position of each layer
 L transformer blocks(T) attention mask M = the dependency between nodes
 上三角矩阵来参数化score network
 $Emb_{pos}$捕获层之间的拓扑关系
 ![image-20240713185634435](http://ff.mhrooz.xyz:4080/2024/07/13/ab839f143c566.png)
 ##### conditional neural architecture generation
 generate neural architectures from the conditional distribution $p_t(A_t|y)$ by soving, which is the reverse time SDE  and a conditional probability
 ![image-20240713221223102](http://ff.mhrooz.xyz:4080/2024/07/13/b806fcf0842d1.png)
 decompose $\nabla{A_t}\log p_t(A_t|y)$
 ![image-20240713221449468](./assets/image-20240713221449468.png)
 前者用score network求解，后者用predictor $f_\phi(y|A_t)$, $\phi$是参数，可以把上面的式子拆分成
 ![image-20240713222701170](http://ff.mhrooz.xyz:4080/2024/07/13/6863bf889dcd8.png)
 这样的话，只需要训练一次score network，之后针对不同任务只需要训练predictor就可以
 ##### Transferable
 数据集级别的predictor$f_\phi(D,A_t)$，predictor是meta-leraned 的
 ![image-20240713224546871](./assets/image-20240713224546871.png)
 ![image-20240713224026605](http://ff.mhrooz.xyz:4080/2024/07/13/2f4c83d7362ab.png)
 ### Experiments
 #### Abstract
 1. Transferable NAS and Bayesian Optimization(BO)-based NAS. Speedups of up to 35x
 2. integrated into a BO-based algorithm, outperforms
 #### Introduction
 1. Transferable NAS and Bayesian Optimization(BO)-based NAS.
   1. Transferable NAS use transferable dataset-aware predictors
   2. DiffusionNAG demonstrates superior generation quality compared to MetaD2A 
   3. This is because DiffusionNAG overcomes the limitation of existing BO-based NAS, which **samples low-quality architectures during the initial phase**, by sampling from the space of the architectures that satisfy the given properties. 
 #### experiments
 predictor is trained by the mbv3 and nb201
 #### express methods
 tables
 ![image-20240713225652194](./assets/image-20240713225652194.png)
 ##### bar plots
 ![image-20240713225711345](./assets/image-20240713225711345.png)
 ### Questionable Point
 #### Introduction
 1. various types of NAS tasks (e.g., latency or robustness-constrained NAS)
 ## DiGress 2209.14734
 ### key research question
 - discrete denoising diffusion model for generating graphs with **categorical** node and edge attributes
 ### Innovation point
 #### absctract
 - **discrete diffusion process**, progressively edits graphs with noise
 - **graph transformer** = denoiser , turn distribution learning over graphs into a sequence of node and edge classification tasks.
 - **Markovian noise model**, preserves the marginal distribution of node and edge types during diffusion
 - Procedure for conditioning the generation on graph-level features.
 #### Introduction
 - previous, add **Gaussion noise** to node features and `adj_matrix`, continuous diffusionmay destroys the graphs's sparsity and creates complete noisy graphs
 - DiGress. Noise = graphs edits(edge addition or deletion)
 - graph transformer denoiser predict the clean graph from a noisy input, result admits an elbo for likelihood estimation
 - **guidance procedure** for conditioning graph generation on **graph-level properties**
 ### method
 noise model $q$
 data point $x$
 a sequences of increasingly noisy data points $(z^1,...,z^T)$, where $q(z^1,...,z^T|x) = q(z^1|x)\prod_{t=2}^Tq(z^t|z^{t-1})$
 denoising nn. $\phi_\theta$
 #### Diffusion models
 噪声从先验分布中采样，然后迭代地通过应用解噪网络解噪
 denoising network is not trained to directly predict $z^{t-1}$
 when $\int q(z^{t-1}|z^t,x)dp_\theta(x)$ tractable, $x$ can be used as the target of the denoising network, which removes an important source of label noise
 ### experiments
 #### abstract
 - 3x validity improvement on a planar graph dataset
 - scale to the large GuacaMol dataset containing 1.3M drug-like molecules without the use of molecule-specific representations.
 diffusion 模型
 前向加噪
 $q(x^t∣x^{t−1})=\mathcal{N}(x^t;\sqrt{1−β_t}x_{t−1},β_tI)$
 ![image-20240715110316386](http://ff.mhrooz.xyz:4080/2024/07/15/69ee7d7a6da94.png)
 后向减噪
 ![image-20240715110401055](./assets/image-20240715110401055.png)
 训练是通过数据集，学习一个减噪器
 ![image-20240715110519427](./assets/image-20240715110519427.png)
 生成从latent space中采样，然后减噪
 ![image-20240715110545716](./assets/image-20240715110545716.png)
 ## Appendix
 ### Bilinear Sampling
@ -393,7 +614,175 @@ conv nn提取d维特征，分辨率降低4倍
   = 25
   $$
 所以，位置 \( (2.5, 3.5) \) 的估算像素值为 25。
 双线性采样通过上述方法有效地平滑了图像在非整数位置的值，广泛应用于各种图像处理和计算机视觉任务中。
 ### 分布
 **条件分布**
 $P(A|B)=\frac{P(AB)}{P(B)}$
 **先验**
 $P(\theta)$ 它反映了在没有观察到数据之前，我们对参数值的主观信念或知识。
 **后验**
 $P(\theta|x) = \frac{P(x|\theta)P(\theta)}{P(x)}$
 **似然**
 给定一个模型和参数，我们使用似然函数来衡量参数给定数据的可能性。与概率不同，似然是关于参数的函数，而不是关于数据的函数。
 $L(\theta|x)=P(x|\theta)$
 **边缘分布**
 $P(A) = \int P(A,B)dB = \int P(A|B)P(B)dB$
 ### 闭式表达式（Closed-Form Formula）
 **闭式表达式**（Closed-Form Formula）是指一个数学表达式可以通过有限的、通常涉及基本算术运算（如加、减、乘、除）、幂、根、指数、对数以及已知函数（如三角函数）来表示和计算。换句话说，闭式表达式允许我们在有限步内精确计算出结果，而不需要通过数值方法或迭代过程。
 如果$ q(z^t | x)$是一个闭式表达式，那么我们可以直接计算出结果，而不需要迭代或数值近似。
 ### ELBO Evidence Lower Bound
 证据下界（Evidence Lower Bound, ELBO）是变分推断中的一个重要概念，通常用于近似后验分布和优化概率模型。ELBO 的主要目的是通过优化一个可计算的下界，来近似原本难以直接计算的后验分布。
 #### 背景
 在贝叶斯推断中，我们希望计算后验分布 \( p(z | x) \)，其中 \( z \) 是潜在变量， \( x \) 是观测数据。然而，直接计算后验分布通常是不可行的，因为它需要计算边缘似然 \( p(x) \)，而边缘似然涉及对所有可能的潜在变量 \( z \) 进行积分或求和：
 \[ p(x) = \int p(x, z) \, dz \]
 #### 变分推断
 变分推断通过引入一个简单的分布 \( q(z) \) 来近似复杂的后验分布 \( p(z | x) \)，并通过优化使两者尽可能接近。具体来说，我们希望最小化 \( q(z) \) 和 \( p(z | x) \) 之间的差异，通常使用 Kullback-Leibler 散度（KL 散度）来度量这种差异：
 \[ \text{KL}(q(z) \| p(z | x)) \]
 #### 证据下界（ELBO）
 ELBO 是一个优化目标，最大化 ELBO 等价于最小化 KL 散度。通过最大化 ELBO，我们可以间接地使 \( q(z) \) 和 \( p(z | x) \) 变得尽可能接近。ELBO 的推导如下：
 #### 1. 目标函数
 我们希望最大化对数边缘似然 \( \log p(x) \)：
 $ \log p(x) = \log \int p(x, z) \, dz $
 #### 2. 引入变分分布 \( q(z) \)
 我们引入变分分布 \( q(z) \)，并通过 Jensen 不等式得到下界：
 $ \log p(x) = \log \int q(z) \frac{p(x, z)}{q(z)} \, dz$
 $\geq \int q(z) \log \frac{p(x, z)}{q(z)} \, dz$
 这个不等式即为 ELBO：
 $\text{ELBO} = \mathbb{E}_{q(z)} \left[ \log \frac{p(x, z)}{q(z)} \right] $
 #### 3. 进一步分解
 ELBO 可以进一步分解为重构误差和 KL 散度两部分：
 $\text{ELBO} = \mathbb{E}_{q(z)}[\log p(x | z)] - \text{KL}(q(z) \| p(z))$
 其中：
 - $\mathbb{E}_{q(z)}[\log p(x | z)]$ 是重构误差项，表示在给定潜在变量 \( z \) 的情况下观测数据 \( x \) 的对数似然的期望值。
 - $\text{KL}(q(z) \| p(z))$是 KL 散度项，表示 \( q(z) \) 和 \( p(z) \) 之间的差异。
 #### 直观解释
 - **重构误差**：这个项衡量的是模型在给定潜在变量 \( z \) 的情况下重构观测数据 \( x \) 的能力。我们希望这个值尽可能大，表示模型能够很好地解释数据。
 - **KL 散度**：这个项衡量的是近似分布 \( q(z) \) 与先验分布 \( p(z) \) 之间的差异。我们希望这个值尽可能小，表示 \( q(z) \) 和 \( p(z) \) 之间的差异最小化。
 #### 在变分自编码器（VAE）中的应用
 在变分自编码器（VAE）中，ELBO 被用作目标函数进行优化。VAE 的目标是学习一个生成模型 \( p(x | z) \) 和一个近似后验 \( q(z | x) \)。优化目标是最大化 ELBO：
 $ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q(z | x)}[\log p(x | z)] - \text{KL}(q(z | x) \| p(z))$
 #### 总结
 证据下界（ELBO）是变分推断中的一个核心概念，通过最大化 ELBO，可以有效地近似后验分布。ELBO 由重构误差和 KL 散度两部分组成，分别衡量模型的重构能力和近似分布与先验分布之间的差异。在实际应用中，如变分自编码器中，ELBO 被用作优化目标，帮助模型学习更好的潜在表示和生成能力。
 希望这对你理解 ELBO 有所帮助。如果你有进一步的问题或需要更多的解释，请告诉我。
 ### KL散度
 KL散度（Kullback-Leibler Divergence），又称相对熵（Relative Entropy），是一个在信息论和概率论中用于度量两个概率分布之间差异的非对称性度量。它衡量的是从分布 \( Q \) 到分布 \( P \) 的信息损失。
 #### KL散度的定义
 给定两个概率分布 \( P \) 和 \( Q \)，它们定义在相同的样本空间 \( \mathcal{X} \) 上，KL散度 \( D_{KL}(P \| Q) \) 定义为：
 $D_{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}$
 对于连续型概率分布，KL散度定义为：
 $D_{KL}(P \| Q) = \int_{-\infty}^{\infty} P(x) \log \frac{P(x)}{Q(x)} \, dx$
 #### 直观理解
 KL散度可以理解为在分布 \( Q \) 下编码分布 \( P \) 的数据所需的额外信息量。也就是说，如果我们用 \( Q \) 作为近似分布来编码实际上由 \( P \) 生成的数据，KL散度度量了这种近似所带来的信息损失。
 #### 性质
 1. **非负性**：
   $D_{KL}(P \| Q) \geq 0$
   KL散度总是非负的，即使 \( P \) 和 \( Q \) 完全一样，KL散度也为0。这是因为对数函数的性质以及Jensen不等式。
 2. **非对称性**：
   $ D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$
   KL散度一般不对称，这意味着从 \( P \) 到 \( Q \) 的信息损失不等于从 \( Q \) 到 \( P \) 的信息损失。
 3. **唯一性**：
   当且仅当 \( P \) 和 \( Q \) 在所有点上都相等时，KL散度为零。
 #### 应用
 KL散度在信息论、统计学和机器学习中有广泛应用。以下是一些典型应用：
 #### 1. 变分推断
 在变分推断中，KL散度用于衡量近似后验分布 \( q(z|x) \) 与真实后验分布 \( p(z|x) \) 之间的差异。优化目标通常是最小化这种差异：
 $\text{KL}(q(z|x) \| p(z|x))$
 通过最大化ELBO（证据下界），变分推断间接地最小化了这个KL散度。
 ### 示例
 假设我们有两个离散概率分布 \( P \) 和 \( Q \)：
 \[ P = \{0.4, 0.6\} \]
 \[ Q = \{0.5, 0.5\} \]
 那么，KL散度 \( D_{KL}(P \| Q) \) 计算如下：
 \[ D_{KL}(P \| Q) = 0.4 \log \frac{0.4}{0.5} + 0.6 \log \frac{0.6}{0.5} \]
 \[ = 0.4 \log 0.8 + 0.6 \log 1.2 \]
 \[ = 0.4 \cdot (-0.2231) + 0.6 \cdot 0.1823 \]
 \[ = -0.08924 + 0.10938 \]
 \[ = 0.02014 \]
 这个结果表示用 \( Q \) 来近似 \( P \) 的信息损失。
 #### 总结
 KL散度是衡量两个概率分布之间差异的一个重要工具，在许多领域都有广泛的应用。它提供了一种量化分布间信息差异的方法，并在变分推断、信息论和机器学习等方面起到关键作用。希望这些解释能帮助你更好地理解KL散度。如果你有进一步的问题或需要更多的解释，请随时告诉我。
 ### words
 gas permeability 气体渗透率
 conformation 构象
 Auxiliary 辅助性的