mirror of
https://github.com/xai-org/x-algorithm.git
synced 2026-02-13 03:05:06 +01:00
326 lines
21 KiB
Markdown
326 lines
21 KiB
Markdown
# X For You Feed Algorithm
|
||
|
||
This repository contains the core recommendation system powering the "For You" feed on X. It combines in-network content (from accounts you follow) with out-of-network content (discovered through ML-based retrieval) and ranks everything using a Grok-based transformer model.
|
||
|
||
> **Note:** The transformer implementation is ported from the [Grok-1 open source release](https://github.com/xai-org/grok-1) by xAI, adapted for recommendation system use cases.
|
||
|
||
## Table of Contents
|
||
|
||
- [Overview](#overview)
|
||
- [System Architecture](#system-architecture)
|
||
- [Components](#components)
|
||
- [Home Mixer](#home-mixer)
|
||
- [Thunder](#thunder)
|
||
- [Phoenix](#phoenix)
|
||
- [Candidate Pipeline](#candidate-pipeline)
|
||
- [How It Works](#how-it-works)
|
||
- [Pipeline Stages](#pipeline-stages)
|
||
- [Scoring and Ranking](#scoring-and-ranking)
|
||
- [Filtering](#filtering)
|
||
- [Key Design Decisions](#key-design-decisions)
|
||
- [License](#license)
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
The For You feed algorithm retrieves, ranks, and filters posts from two sources:
|
||
|
||
1. **In-Network (Thunder)**: Posts from accounts you follow
|
||
2. **Out-of-Network (Phoenix Retrieval)**: Posts discovered from a global corpus
|
||
|
||
Both sources are combined and ranked together using **Phoenix**, a Grok-based transformer model that predicts engagement probabilities for each post. The final score is a weighted combination of these predicted engagements.
|
||
|
||
We have eliminated every single hand-engineered feature and most heuristics from the system. The Grok-based transformer does all the heavy lifting by understanding your engagement history (what you liked, replied to, shared, etc.) and using that to determine what content is relevant to you.
|
||
|
||
---
|
||
|
||
## System Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ FOR YOU FEED REQUEST │
|
||
└─────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ HOME MIXER │
|
||
│ (Orchestration Layer) │
|
||
├─────────────────────────────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ QUERY HYDRATION │ │
|
||
│ │ ┌──────────────────────────┐ ┌──────────────────────────────────────────────┐ │ │
|
||
│ │ │ User Action Sequence │ │ User Features │ │ │
|
||
│ │ │ (engagement history) │ │ (following list, preferences, etc.) │ │ │
|
||
│ │ └──────────────────────────┘ └──────────────────────────────────────────────┘ │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ CANDIDATE SOURCES │ │
|
||
│ │ ┌─────────────────────────────┐ ┌────────────────────────────────┐ │ │
|
||
│ │ │ THUNDER │ │ PHOENIX RETRIEVAL │ │ │
|
||
│ │ │ (In-Network Posts) │ │ (Out-of-Network Posts) │ │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ │ Posts from accounts │ │ ML-based similarity search │ │ │
|
||
│ │ │ you follow │ │ across global corpus │ │ │
|
||
│ │ └─────────────────────────────┘ └────────────────────────────────┘ │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ HYDRATION │ │
|
||
│ │ Fetch additional data: core post metadata, author info, media entities, etc. │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ FILTERING │ │
|
||
│ │ Remove: duplicates, old posts, self-posts, blocked authors, muted keywords, etc. │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ SCORING │ │
|
||
│ │ ┌──────────────────────────┐ │ │
|
||
│ │ │ Phoenix Scorer │ Grok-based Transformer predicts: │ │
|
||
│ │ │ (ML Predictions) │ P(like), P(reply), P(repost), P(click)... │ │
|
||
│ │ └──────────────────────────┘ │ │
|
||
│ │ │ │ │
|
||
│ │ ▼ │ │
|
||
│ │ ┌──────────────────────────┐ │ │
|
||
│ │ │ Weighted Scorer │ Weighted Score = Σ (weight × P(action)) │ │
|
||
│ │ │ (Combine predictions) │ │ │
|
||
│ │ └──────────────────────────┘ │ │
|
||
│ │ │ │ │
|
||
│ │ ▼ │ │
|
||
│ │ ┌──────────────────────────┐ │ │
|
||
│ │ │ Author Diversity │ Attenuate repeated author scores │ │
|
||
│ │ │ Scorer │ to ensure feed diversity │ │
|
||
│ │ └──────────────────────────┘ │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ SELECTION │ │
|
||
│ │ Sort by final score, select top K candidates │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ FILTERING (Post-Selection) │ │
|
||
│ │ Visibility filtering (deleted/spam/violence/gore etc) │ │
|
||
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ RANKED FEED RESPONSE │
|
||
└─────────────────────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Components
|
||
|
||
### Home Mixer
|
||
|
||
**Location:** [`home-mixer/`](home-mixer/)
|
||
|
||
The orchestration layer that assembles the For You feed. It leverages the `CandidatePipeline` framework with the following stages:
|
||
|
||
| Stage | Description |
|
||
|-------|-------------|
|
||
| Query Hydrators | Fetch user context (engagement history, following list) |
|
||
| Sources | Retrieve candidates from Thunder and Phoenix |
|
||
| Hydrators | Enrich candidates with additional data |
|
||
| Filters | Remove ineligible candidates |
|
||
| Scorers | Predict engagement and compute final scores |
|
||
| Selector | Sort by score and select top K |
|
||
| Post-Selection Filters | Final visibility and dedup checks |
|
||
| Side Effects | Cache request info for future use |
|
||
|
||
The server exposes a gRPC endpoint (`ScoredPostsService`) that returns ranked posts for a given user.
|
||
|
||
---
|
||
|
||
### Thunder
|
||
|
||
**Location:** [`thunder/`](thunder/)
|
||
|
||
An in-memory post store and realtime ingestion pipeline that tracks recent posts from all users. It:
|
||
|
||
- Consumes post create/delete events from Kafka
|
||
- Maintains per-user stores for original posts, replies/reposts, and video posts
|
||
- Serves "in-network" post candidates from accounts the requesting user follows
|
||
- Automatically trims posts older than the retention period
|
||
|
||
Thunder enables sub-millisecond lookups for in-network content without hitting an external database.
|
||
|
||
---
|
||
|
||
### Phoenix
|
||
|
||
**Location:** [`phoenix/`](phoenix/)
|
||
|
||
The ML component with two main functions:
|
||
|
||
#### 1. Retrieval (Two-Tower Model)
|
||
Finds relevant out-of-network posts:
|
||
- **User Tower**: Encodes user features and engagement history into an embedding
|
||
- **Candidate Tower**: Encodes all posts into embeddings
|
||
- **Similarity Search**: Retrieves top-K posts via dot product similarity
|
||
|
||
#### 2. Ranking (Transformer with Candidate Isolation)
|
||
Predicts engagement probabilities for each candidate:
|
||
- Takes user context (engagement history) and candidate posts as input
|
||
- Uses special attention masking so candidates cannot attend to each other
|
||
- Outputs probabilities for each action type (like, reply, repost, click, etc.)
|
||
|
||
See [`phoenix/README.md`](phoenix/README.md) for detailed architecture documentation.
|
||
|
||
---
|
||
|
||
### Candidate Pipeline
|
||
|
||
**Location:** [`candidate-pipeline/`](candidate-pipeline/)
|
||
|
||
A reusable framework for building recommendation pipelines. Defines traits for:
|
||
|
||
| Trait | Purpose |
|
||
|-------|---------|
|
||
| `Source` | Fetch candidates from a data source |
|
||
| `Hydrator` | Enrich candidates with additional features |
|
||
| `Filter` | Remove candidates that shouldn't be shown |
|
||
| `Scorer` | Compute scores for ranking |
|
||
| `Selector` | Sort and select top candidates |
|
||
| `SideEffect` | Run async side effects (caching, logging) |
|
||
|
||
The framework runs sources and hydrators in parallel where possible, with configurable error handling and logging.
|
||
|
||
---
|
||
|
||
## How It Works
|
||
|
||
### Pipeline Stages
|
||
|
||
1. **Query Hydration**: Fetch the user's recent engagements history and metadata (eg. following list)
|
||
|
||
2. **Candidate Sourcing**: Retrieve candidates from:
|
||
- **Thunder**: Recent posts from followed accounts (in-network)
|
||
- **Phoenix Retrieval**: ML-discovered posts from the global corpus (out-of-network)
|
||
|
||
3. **Candidate Hydration**: Enrich candidates with:
|
||
- Core post data (text, media, etc.)
|
||
- Author information (username, verification status)
|
||
- Video duration (for video posts)
|
||
- Subscription status
|
||
|
||
4. **Pre-Scoring Filters**: Remove posts that are:
|
||
- Duplicates
|
||
- Too old
|
||
- From the viewer themselves
|
||
- From blocked/muted accounts
|
||
- Containing muted keywords
|
||
- Previously seen or recently served
|
||
- Ineligible subscription content
|
||
|
||
5. **Scoring**: Apply multiple scorers sequentially:
|
||
- **Phoenix Scorer**: Get ML predictions from the Phoenix transformer model
|
||
- **Weighted Scorer**: Combine predictions into a final relevance score
|
||
- **Author Diversity Scorer**: Attenuate repeated author scores for diversity
|
||
- **OON Scorer**: Adjust scores for out-of-network content
|
||
|
||
6. **Selection**: Sort by score and select the top K candidates
|
||
|
||
7. **Post-Selection Processing**: Final validation of post candidates to be served
|
||
|
||
---
|
||
|
||
### Scoring and Ranking
|
||
|
||
The Phoenix Grok-based transformer model predicts probabilities for multiple engagement types:
|
||
|
||
```
|
||
Predictions:
|
||
├── P(favorite)
|
||
├── P(reply)
|
||
├── P(repost)
|
||
├── P(quote)
|
||
├── P(click)
|
||
├── P(profile_click)
|
||
├── P(video_view)
|
||
├── P(photo_expand)
|
||
├── P(share)
|
||
├── P(dwell)
|
||
├── P(follow_author)
|
||
├── P(not_interested)
|
||
├── P(block_author)
|
||
├── P(mute_author)
|
||
└── P(report)
|
||
```
|
||
|
||
The **Weighted Scorer** combines these into a final score:
|
||
|
||
```
|
||
Final Score = Σ (weight_i × P(action_i))
|
||
```
|
||
|
||
Positive actions (like, repost, share) have positive weights. Negative actions (block, mute, report) have negative weights, pushing down content the user would likely dislike.
|
||
|
||
---
|
||
|
||
### Filtering
|
||
|
||
Filters run at two stages:
|
||
|
||
**Pre-Scoring Filters:**
|
||
| Filter | Purpose |
|
||
|--------|---------|
|
||
| `DropDuplicatesFilter` | Remove duplicate post IDs |
|
||
| `CoreDataHydrationFilter` | Remove posts that failed to hydrate core metadata |
|
||
| `AgeFilter` | Remove posts older than threshold |
|
||
| `SelfpostFilter` | Remove user's own posts |
|
||
| `RepostDeduplicationFilter` | Dedupe reposts of same content |
|
||
| `IneligibleSubscriptionFilter` | Remove paywalled content user can't access |
|
||
| `PreviouslySeenPostsFilter` | Remove posts user has already seen |
|
||
| `PreviouslyServedPostsFilter` | Remove posts already served in session |
|
||
| `MutedKeywordFilter` | Remove posts with user's muted keywords |
|
||
| `AuthorSocialgraphFilter` | Remove posts from blocked/muted authors |
|
||
|
||
**Post-Selection Filters:**
|
||
| Filter | Purpose |
|
||
|--------|---------|
|
||
| `VFFilter` | Remove posts that are deleted/spam/violence/gore etc. |
|
||
| `DedupConversationFilter` | Deduplicate multiple branches of the same conversation thread |
|
||
|
||
---
|
||
|
||
## Key Design Decisions
|
||
|
||
### 1. No Hand-Engineered Features
|
||
The system relies entirely on the Grok-based transformer to learn relevance from user engagement sequences. No manual feature engineering for content relevance. This significantly reduces the complexity in our data pipelines and serving infrastructure.
|
||
|
||
### 2. Candidate Isolation in Ranking
|
||
During transformer inference, candidates cannot attend to each other—only to the user context. This ensures the score for a post doesn't depend on which other posts are in the batch, making scores consistent and cacheable.
|
||
|
||
### 3. Hash-Based Embeddings
|
||
Both retrieval and ranking use multiple hash functions for embedding lookup
|
||
|
||
### 4. Multi-Action Prediction
|
||
Rather than predicting a single "relevance" score, the model predicts probabilities for many actions.
|
||
|
||
### 5. Composable Pipeline Architecture
|
||
The `candidate-pipeline` crate provides a flexible framework for building recommendation pipelines with:
|
||
- Separation of pipeline execution and monitoring from business logic
|
||
- Parallel execution of independent stages and graceful error handling
|
||
- Easy addition of new sources, hydrations, filters, and scorers
|
||
|
||
---
|
||
|
||
## License
|
||
|
||
This project is licensed under the Apache License 2.0. See [LICENSE](LICENSE) for details.
|