Transcribe an audio or video file using AssemblyAI with speaker diarization. Returns raw transcript text and speaker-labelled segments. Charged to the user’s credit account (USD). Does NOT create any document or thread. The client should use the returned text to call POST /faces//upload.