Large Language Model

Generative Model

Autoregressive (AR) model: generate output one token at a time, conditioned on previous tokens.
Non-autoregressive (NAR) model: generate output all at once parallel, without conditioning on previous tokens.

	AR Model	NAR Model
Parallelism	Low	High
Speed	Slow	Fast
Quality	High	Low

结合上述两种方法 (Encoder + Decoder 架构):

用 AR model 生成中间向量, 用 NAR model 生成最终输出.
用 NAR model 多次生成, 逐步优化输出.
Speculative decoding: 用 NAR model 快速生成若干个预测输出, 作为 AR model 的后续输入, 使得 AR model 可以同时输出多个结果.

Generative Model

ChatGPT

Fine-tuned GPT model on conversational data:

Pre-training: 学习文字接龙, 学习大规模资料 (self-supervised learning), 生成下一个单词.
Instruction-tuning (IT): 人工文字接龙, 人工标注部分问题的答案 (supervised learning), 引导模型生成的方向.
Reinforcement learning from human feedback (RLHF): 训练一个 reward model, 负责评价模型生成的答案, 提供人类反馈. 以 reward model 的评价分数为 reward, 通过强化学习优化模型. 一般聚焦于三个方面: 有用性 (Helpfulness), 诚实性 (Honesty), 无害性 (Harmlessness).

Alignment

Instruction-tuning (IT) with supervised learning on labelled data and reinforcement learning from human feedback (RLHF).

Diffusion

Forward process (diffusion) + reverse process (denoise):

Stable diffusion model:

Video

Generative videos as world models simulator.

Embeddings

Various types:

Continuous bag of words (CBOW): Predict middle word, with embeddings of surrounding words as input. Fast to train and accurate for frequent words.
Skip-gram: predict surrounding words in certain range, with middle word as input. Slower to train but accurate for rare words.
GloVe/SWIVEL: Capture both global and local information about words with co-occurrence matrix.
Shallow BoW.
Deeper pre-trained.
Multi-modal: image.
Structured data.
Graph.

Word Embeddings

Embedding Models

Recommendation Systems

Maps two sets of data (user dataset, item/product/etc dataset) into the same embedding space.

Embeddings + ANN (approximate nearest neighbor) vector stores (ScaNN, FAISS, LSH, KD-Tree, and Ball-tree):

Retrieval augmented generation (RAG).
Semantic text similarity.
Few shot classification.
Clustering.
Recommendation systems.
Anomaly detection.

Vector Similarity

Group Relative Policy Optimization

GRPO tricks.

Inference Acceleration

Quantization: 改变模型权重和激活值的精度.
Distillation: data, knowledge, on policy.
Flash attention: minimize data move between slow HBM to faster memory tier (SRAM/VMEM).
Prefix caching: avoid recalculating attention scores for input on each auto-regressive decode step.
Speculative decoding: generate multiple candidates with drafter model (much smaller).
Batching and parallelization: sequence, pipeline, tensor.

Reasoning

Test-time compute (inference-time compute): prompting models to generate intermediate reasoning steps dramatically improved performance on hard problems.

tip

Thinking tokens are model's only persistent memory during reasoning.

Retrieval-Augmented Generation

检索增强生成, 通常称为 RAG (Retrieval-Augmented Generation), 是一种强大的聊天机器人的设计模式. 其中, 检索系统实时获取与查询相关的经过验证的源 / 文档, 并将其输入生成模型 (例如 GPT-4) 以生成响应:

Effect: reduce hallucination.
Cost: avoid retraining.

Retrieval-Augmented Generation

Context

Context is everything when it comes to getting the most out of an AI tool. To improve the relevance and quality of a generative AI output, you need to improve the relevance and quality of the input.

tip

Quality in, quality out.

Agentic

Agentic RAG (autonomous retrieval agents) actively refine their search based on iterative reasoning:

Context-aware query expansion.
Multi-step reasoning.
Adaptive source selection.
Validation and correction.

Scaling Law

现有的预训练语言模型对于数据的需求量远高于扩展法则 (e.g. Chinchilla) 中所给出的估计规模. 很多更小的模型也能够通过使用超大规模的预训练数据获得较大的模型性能提升. 这种现象的一个重要原因是由于 Transformer 架构具有较好的数据扩展性. 目前为止, 还没有实验能够有效验证特定参数规模语言模型的饱和数据规模 (即随着数据规模的扩展, 模型性能不再提升).

Emergent Ability

大语言模型的涌现能力被非形式化定义为 在小型模型中不存在但在大模型中出现的能力:

In-context learning.
Instruction following.
Step-by-step reasoning.

Recursive Language Models

RLM 通过分治与递归, 实现多跳推理代码, 解决长文本带来的 Context Rot 问题.

Library

Embeddings and Vector Stores

OpenCLIP: Encodes images and text into numerical vectors.
Chroma: AI-native embedding database.
FaISS: Similarity searching and dense vectors clustering library.

Text-to-Speech

MiniMax: MiniMax voice agent.
ChatTTS: Generative speech model for daily dialogue.
ChatterBox: SoTA open-source TTS model.

LLMs

OLlama: Get up and running large language models locally.
OLlamaUI: User-friendly web UI for LLMs.
Transformers: Run HuggingFace transformers directly in browser.
Jan: Offline local ChatGPT.
LocalLLMs: Curated list of locally running LLMs.

References

大语言模型综述.
Efficient LLM architectures survey.
Foundational LLM whitepaper.

Generative Model​

ChatGPT​

Diffusion​

Video​

Embeddings​

Group Relative Policy Optimization​

Inference Acceleration​

Reasoning​

Retrieval-Augmented Generation​

Context​

Agentic​

Scaling Law​

Emergent Ability​

Recursive Language Models​

Library​

Embeddings and Vector Stores​

Text-to-Speech​

LLMs​

References​

Generative Model

ChatGPT

Diffusion

Video

Embeddings

Group Relative Policy Optimization

Inference Acceleration

Reasoning

Retrieval-Augmented Generation

Context

Agentic

Scaling Law

Emergent Ability

Recursive Language Models

Library

Embeddings and Vector Stores

Text-to-Speech

LLMs

References