Participants don't connect to each other. Every participant — the user and the tm-agent — opens one WebRTC connection to the LiveKit server (the SFU), which forwards a copy of each track to everyone else.
Each client→SFU leg is a WebRTC connection that must punch through NAT. WebRTC uses ICE to find a working path by gathering candidates:
STUN = discover address. TURN = relay the media. Both exist purely to traverse NAT — like your teammate said, "macam NAT."
The SFU runs as a pod on a private IP, behind a load balancer — a browser can't reach a pod IP directly (exactly your lead's point: "livekit server lives behind load balancer").
Different participant types join the same SFU room via different paths — this is why both a TURN server and a SIP service exist: