Project Detail · AI/ML System

PPO vs A2C Reinforcement Learning xApp for O-RAN Resource Allocation

This project develops and compares two actor-critic reinforcement learning solutions — Advantage Actor-Critic (A2C) and Proximal Policy Optimization (PPO) — for resource allocation in O-RAN. The system is designed as an end-to-end near-real-time RIC xApp that consumes network observations, optimizes scheduling decisions for mobile users, and sends control actions back to the RAN through the E2 interface. From an AI/ML engineering perspective, the project covers the full delivery chain: xApp design, RIC-aware data flow, InfluxDB / Redis metric sharing, Docker containerization, Kubernetes pod deployment, RL training and evaluation, and comparative analysis of PPO and A2C.

Algorithms
PPO vs A2C
Environment
5 gNBs / 10 moving UEs
Deployment
Docker + Kubernetes xApp
Optimization Target
Reward-driven QoE maximization

PPO vs Advantage Actor-Critic (A2C) Reinforcement Learning-based Models

RL Comparison

The learning problem is formulated as a Markov Decision Process (MDP), where the xApp observes the network state, chooses a scheduling action, receives reward, and updates its policy to maximize long-term QoE. Both A2C and PPO belong to the actor-critic family, but they differ in how they update the policy. A2C performs a direct policy-gradient update using the advantage estimate, while PPO introduces a clipped surrogate objective to make updates more conservative and stable.

Core formulation

Policy objective
π* = arg max_π E[ Σ_t γ^t r_t ]

The agent learns a policy that maximizes expected discounted reward over time.

Advantage estimate
A(s_t, a_t) = Q(s_t, a_t) - V(s_t)

The critic provides the baseline used to judge whether the selected action was better than expected.

A2C actor update
L_A2C = - E[ log π_θ(a_t|s_t) · A(s_t, a_t) ]

A2C updates policy directly using the advantage-weighted policy gradient.

PPO clipped objective
L_PPO = E[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]

PPO constrains policy changes with clipping, which improves optimization stability.

Practical comparison

A2C: simpler actor-critic update, lower algorithmic complexity, and strong early learning behavior. It is a good baseline for control-oriented resource allocation.

PPO: retains the actor-critic structure but adds policy-ratio clipping, which generally improves stability and makes learning less sensitive to large policy jumps.

Why the comparison matters: in a near-RT RIC setting, convergence speed, reward stability, and robust scheduling behavior are just as important as final reward.

Observed behavior in this project: both models converge successfully, but PPO reaches the peak reward in fewer steps and maintains smoother optimization behavior.

Engineering takeaway

A2C provides a compact and effective baseline, while PPO offers a stronger deployment candidate when training stability and safer policy improvement are priorities.

System Design and O-RAN Integration

Control Architecture

The project is built around the near-RT RIC as the control platform for deploying AI-based xApps in an O-RAN system. The core engineering requirement was to build a deployable control application that could receive metrics from the RAN, process them inside the RIC application layer, and send back control messages for resource scheduling.

The architecture uses the E2 interface as the primary path between the RAN and the xApp. The KPIMON xApp collects network metrics from E2 nodes, and the RL xApp consumes those observations to generate resource-allocation actions that are returned to the RAN through control messages.

RIC-side engineering components

  • near-RT RIC: runtime platform for xApp control logic
  • E2 Termination: interface endpoint for RAN interaction
  • KPIMON: metric collection from E2 nodes
  • Control path: feedback channel for RL-selected scheduling actions
  • xApp runtime: execution layer for monitoring and closed-loop control

xApp control workflow

  • Observation intake: KPIs flow from RAN to RIC through E2
  • State construction: RL state is formed from network measurements
  • Action generation: the xApp selects resource-allocation actions
  • Control delivery: actions are sent back to the RAN through E2 control messages
O-RAN near-RT RIC architecture with RL xApp

Time-Series Database InfluxDB and Redis with Data Flow

Data Layer

A key systems component of the project is the RIC-side data layer, where network measurements are collected, stored, and exposed to the learning-based xApp. The KPIMON xApp receives indication messages from the RAN through the E2 interface, decodes them, and stores the extracted values in Redis. These values are then made available through InfluxDB as a time-series view of network behavior.

From an engineering perspective, this layer bridges telemetry collection, state persistence, and model consumption. Redis supports fast in-memory exchange inside the RIC workflow, while InfluxDB provides structured time-series access for metric history, analysis, and xApp-side state retrieval. This data path is important because it transforms raw RAN measurements into a usable RL observation stream.

Redis role

  • In-memory storage: fast storage of decoded KPI values inside the RIC workflow
  • Low-latency exchange: supports rapid access to fresh measurements
  • Intermediate layer: bridges metric decoding and downstream consumption
  • RIC compatibility: practical fit for xApp-side operational messaging

InfluxDB role

  • Time-series storage: organizes measurements over time for analysis
  • Metric history: keeps sequential network trends accessible
  • State support: helps construct temporally aware inputs for the RL xApp
  • Monitoring-ready: appropriate for dashboards and performance inspection

Data flow summary

Step 1 — Collection: KPIMON receives KPI indication messages from E2 nodes.
Step 2 — Decoding: ASN.1-based messages are converted into usable metric values.
Step 3 — Redis storage: decoded values are placed into a fast in-memory layer.
Step 4 — InfluxDB view: time-indexed measurements are organized for retrieval and analysis.
Step 5 — RL state usage: the xApp consumes the resulting metric stream to construct state and produce control actions.

Dockerization and Kubernetes Deployment

Deployment Engineering

After implementing the RL logic in Python, the application was containerized with Docker and deployed as a Kubernetes-managed pod. This project therefore covers not only model design, but also packaging, orchestration, and infrastructure-aware delivery.

Deployment flow

  • Develop the xApp in Python
  • Build the Docker image
  • Push the image to a repository
  • Create deployment descriptors
  • Deploy the pod into the Kubernetes cluster

Infrastructure components

  • Master node: API server, scheduler, controller manager, etcd
  • Worker nodes: execution layer for pod-hosted xApps
  • Pods: deployable units containing application containers
  • Kubelet / Kube Proxy for runtime and networking
Kubernetes architecture for deploying the RL xApp

Training Environment and State Design

RL Data + Environment

The xApp was trained and evaluated in a simulated RAN environment with 5 base stations and 10 moving UEs. The RL state includes request, channel quality, data rate, and fairness-related signals.

Environment parameters

  • Number of gNBs: 5
  • Number of UEs: 10
  • Frequency: 3.5 GHz
  • Bandwidth: 10 MHz
  • Max UE downlink load: 1 Mbps

State and optimization signals

  • Channel requests
  • CQI / CSI-aware link quality
  • Achieved data rate
  • Historical allocation trace
  • Scheduling decision for QoE optimization
Simulated RAN environment for PPO and A2C training Environment snapshots showing UE positions and QoE states

Training, Inference, and Comparative Evaluation

The project compares PPO and A2C using reward and sum data-rate behavior as the main evaluation criteria. Both converge successfully, but PPO reaches the maximum reward in fewer steps and shows stronger stability.

Evaluation metrics

Reward and data rate measure QoE-aware scheduling quality.

A2C behavior

A2C performs well early but converges more slowly.

PPO behavior

PPO converges faster and shows better stability.

Average normalized reward comparison between PPO and A2C Environment results showing reward and connectivity

Tech Stack and Deployment Summary

RL Modeling

PPO and A2C actor-critic formulations for wireless resource allocation.

RIC Integration

E2 messaging, KPIMON collection, Redis storage, and InfluxDB sharing.

Containerization

Dockerized xApp packaging for reproducible deployment.

Orchestration

Kubernetes-managed pod execution in the RIC cluster.

Data Layer

Time-series and in-memory metric handling with InfluxDB and Redis.

Evaluation

Reward and data-rate comparison to assess convergence and stability.

PPO A2C Actor-Critic Reinforcement Learning O-RAN near-RT RIC xApp Kubernetes Docker InfluxDB Redis KPIMON E2 Interface TensorFlow Resource Allocation QoE Optimization Network Automation