Qingli Zeng
with co-authors Sandeep Chandukala, Hai Che, Lifeng Yang
Hebrew University of Jerusalem
In live streaming, the anchor's voice is the only continuous channel of communication.
How does how they speak — not what they show — drive viewer engagement?
global live-streaming market, growing 25% YoY
annual subscription earnings for top streamers
Twitch, YouTube Gaming, and Kick competing fiercely
isolating vocal cues from other engagement drivers
| Immediate — Chats | Sustained — Subscriptions | |
|---|---|---|
| What | Real-time messages | Paid recurring commitment |
| Nature | Impulsive, social | Deliberate, financial |
| Drives | Community vibe | Revenue, loyalty |
These are not the same thing. They likely respond to different vocal cues.
Social Approval Theory — Bundy & Pfarrer (2015)
Emotional appeals receive higher evaluations → drives sustained commitment.
Social Influence Theory — Cialdini (2006)
Persuasion effects depend on social context → drives moderation.
Together: different vocal cues map to different engagement types — and the social environment shifts these effects.
enthusiasm,
emotional intensity
tension, pressure,
audible effort
analytical thinking,
information processing
Each carries a distinct social signal.
Anchor's vocal energy — enthusiasm, loudness range, and pitch dynamics in the audio signal.
Aggregated per minute from second-level audio measurements; standardized within streamer to remove baseline differences.
Audible vocal stress — indexed from jitter, shimmer, voice quality irregularities, and prosodic markers of tension.
Interpreted as a signal of authenticity / emotional investment — see Finding 02.
streamers, ~35% of platform viewership
of streaming content
streamer × stream × minute
Negative Binomial regression with Gaussian Copula endogeneity correction
Video dynamics, visual attributes, audio quality, lagged outcomes, anchor fixed effects.
Addresses reverse causality (viewers shape anchor's voice) without needing an external instrument.
| Vocal cue | → Chats | → Subscriptions |
|---|---|---|
| Expressive Energy | ns (β=0.0037) | + β=0.0502*** |
| Stressed | + β=0.0037*** | + β=0.0377*** |
| Cognitive | ns (β=0.0118) | + β=0.0440** |
Two distinct psychological pathways:
*** p<0.01, ** p<0.05 (baseline NB + copula, no interactions yet)
Contrary to intuition:
Interpretation. One possibility: stress may signal authenticity and emotional investment — rewarded in a media landscape saturated with polished, edited content. But the mechanism warrants further investigation.
Channel popularity (top-5 binary) reshapes both energy and stress effects:
| Interaction | β | What it means |
|---|---|---|
| Popularity × Energy → Subs | +0.0285*** | Halo amplifies energy's persuasive force on conversion |
| Popularity × Energy → Chats | −0.0177*** | Big crowds chat anyway — anchor energy matters less |
| Popularity × Stress → Chats | +0.0094*** | Stress becomes a "rallying cry" in popular channels |
| Popularity × Stress → Subs | −0.0226*** | Established audiences want polish, not tension, for paid commitment |
Pre-validation by popularity flips how each vocal cue lands.
Chatroom sentiment (NLTK on viewer messages, lagged) further conditions vocal effects:
| Interaction | β | What it means |
|---|---|---|
| Sentiment × Energy → Subs | −0.302*** | Positive room ⇒ anchor's energy adds less to conversion |
| Sentiment × Stress → Chats | −0.0090** | Positive sentiment slightly attenuates the stress→chat channel |
The social environment partially substitutes for what the anchor's voice provides.
Consistent with Social Influence Theory.
Copula correction is necessary
Lag sensitivity — re-estimated with two-minute lag (t−2)
Other robustness (in appendix)
| Audience | Strategy | Why |
|---|---|---|
| New anchors | Match sentiment | Sentiment amplifies energy → engagement for small audiences |
| Established anchors | Differentiate | Popular channels chat themselves; reserve energy for conversion |
| All anchors | Leverage stress carefully | Stress drives engagement, but excess tension hurts subs in big channels |
Nov 2025 — Jan 2026
Twitch VODs, ~22 TB of video
AWS S3; per-minute panel structure preserved
From proprietary closed-box to open-source, fully reproducible pipeline.
| Construct | Original | New |
|---|---|---|
| Audio features | Nemesysco QA5 | openSMILE eGeMAPSv02 (88 features) |
| Video dynamics | SSIM only | SSIM + scene detection + CLIP embeddings |
| Audio quality | MOSNet | SQUIM + LUFS + voice activity + SNR |
| Linguistic controls | Chat sentiment | + transcript: speech rate, hesitations, lexical diversity, F–K grade |
Transparent, replicable, finer-grained, extensible.
Factor analysis on the new wave (34,000+ minute-observations) recovers a 3-factor structure that mirrors the original:
| Factor | Top loadings | Interpretation |
|---|---|---|
| F1 | Formant amplitudes (F1, F2, F3 relative to F0) | Expressive Energy (loudness/projection) |
| F2 | Formant frequencies + unvoiced segment length | Cognitive Tones (articulation precision) |
| F3 | Alpha ratio, Hammarberg index, MFCC2 | Stressed Tones (spectral tension) |
Same 3-factor structure as Nemesysco — same constructs, transparent measurement.
| Feature | Mean | SD | What it captures |
|---|---|---|---|
| LUFS loudness | −29.8 dB | 5.2 | Broadcast loudness (typical streaming range) |
| VAD ratio | 0.34 | 0.27 | Anchor speaking ~1/3 of the time |
| Speech rate | 80 w/min | 59 | Conversational pace, high variance |
| Lexical diversity | 0.64 | 0.16 | Type-token ratio in transcripts |
| Intra-minute SSIM | 0.62 | 0.17 | Visual stability across seconds within minute |
| Inter-minute CLIP cosine | — | — | Semantic shift between consecutive minutes |
Distributions are well-behaved and theoretically interpretable across all 1,131 streams.
Qingli Zeng · Hebrew University of Jerusalem