EMAC 2026 · DIGITAL MARKETING & SOCIAL MEDIA TRACK

Anchors' Voice Characteristics and Viewer Engagement in Live Streams

Qingli Zeng with co-authors Sandeep Chandukala, Hai Che, Lifeng Yang
Hebrew University of Jerusalem

The puzzle

In live streaming, the anchor's voice is the only continuous channel of communication.

How does how they speak — not what they show — drive viewer engagement?

02

Why it matters

$20B+

global live-streaming market, growing 25% YoY


6–7 figures

annual subscription earnings for top streamers

3 platforms

Twitch, YouTube Gaming, and Kick competing fiercely


0 papers

isolating vocal cues from other engagement drivers

03

The live-streaming ecosystem

Twitch stream interface showing anchor face cam, gameplay video, and chat panel Anchor Video Game Chats Sponsors Free riders
CLICK TO REVEAL →
Click the image → step through each part of the live-streaming ecosystem.
04

Two kinds of engagement

Immediate — Chats Sustained — Subscriptions
WhatReal-time messagesPaid recurring commitment
NatureImpulsive, socialDeliberate, financial
DrivesCommunity vibeRevenue, loyalty

These are not the same thing. They likely respond to different vocal cues.

05

Framework

Anchors' Vocal Characteristics
  • Positivity
  • Stress
  • Cognitive
Chatroom Sentiment
Channel Popularity
Audience's Engagements
  • Chats (immediate)
  • Subscriptions (sustained)
Controls
Linguistic
  • Anticipation
  • Hesitation
  • Concentration
Video
  • Dynamics
  • HSV
Audio
  • MOSNet
  • Harmonic
06

Theoretical framework

Social Approval Theory — Bundy & Pfarrer (2015)

Emotional appeals receive higher evaluations drives sustained commitment.


Social Influence Theory — Cialdini (2006)

Persuasion effects depend on social context drives moderation.

Together: different vocal cues map to different engagement types — and the social environment shifts these effects.

07

Three vocal dimensions

Dimension 01

Expressive
Energy

enthusiasm,
emotional intensity

Dimension 02

Stressed
Tones

tension, pressure,
audible effort

Dimension 03

Cognitive
Tones

analytical thinking,
information processing

Each carries a distinct social signal.

08
Variable definition

Energy

Anchor's vocal energy — enthusiasm, loudness range, and pitch dynamics in the audio signal.

Low energy example
Flat, low-affect delivery
High energy example
Peak excitement, intense affect

Aggregated per minute from second-level audio measurements; standardized within streamer to remove baseline differences.

09
Variable definition

Stress Level

Audible vocal stress — indexed from jitter, shimmer, voice quality irregularities, and prosodic markers of tension.

Low stress example
Relaxed, controlled phonation
High stress example
High tension, audible effort

Interpreted as a signal of authenticity / emotional investment — see Finding 02.

10

Hypotheses

  1. Expressive Energy subscriptions, not chats
  2. Stressed tones both chats and subs (authenticity signal)
  3. Cognitive depth subscriptions (competence signal)
  4. Channel popularity moderates (halo for validated anchors)
  5. Chatroom sentiment moderates (positive vibe = less need for anchor energy)
11

Data

Top 25

streamers, ~35% of platform viewership


1,000+ h

of streaming content

Modalities
  • Audio at 40,000 Hz
  • HD video
  • Chat logs (timestamped)
  • Subscription records (timestamped)
Panel structure

streamer × stream × minute

12

Empirical model

Negative Binomial regression with Gaussian Copula endogeneity correction

μi,t = βA·Audioi,t−1 + βM·Modi,t−1 + βI·(Audio × Mod) + βC·Controlsi,t−1 + λi + εi,t
Controls

Video dynamics, visual attributes, audio quality, lagged outcomes, anchor fixed effects.

Why copula

Addresses reverse causality (viewers shape anchor's voice) without needing an external instrument.

13

Finding 01 — engagement is not unidimensional

Vocal cue → Chats → Subscriptions
Expressive Energy ns (β=0.0037) + β=0.0502***
Stressed + β=0.0037*** + β=0.0377***
Cognitive ns (β=0.0118) + β=0.0440**

Two distinct psychological pathways:

*** p<0.01, ** p<0.05 (baseline NB + copula, no interactions yet)

14

Finding 02 — the stress paradox

Contrary to intuition:

Stressed vocal tones positively affect engagement, both chats (β=0.0037***) and subscriptions (β=0.0377***).

Interpretation. One possibility: stress may signal authenticity and emotional investment — rewarded in a media landscape saturated with polished, edited content. But the mechanism warrants further investigation.

15

Finding 03 — popularity moderates

Channel popularity (top-5 binary) reshapes both energy and stress effects:

Interaction β What it means
Popularity × Energy → Subs +0.0285*** Halo amplifies energy's persuasive force on conversion
Popularity × Energy → Chats −0.0177*** Big crowds chat anyway — anchor energy matters less
Popularity × Stress → Chats +0.0094*** Stress becomes a "rallying cry" in popular channels
Popularity × Stress → Subs −0.0226*** Established audiences want polish, not tension, for paid commitment

Pre-validation by popularity flips how each vocal cue lands.

16

Finding 03 — sentiment moderates

Chatroom sentiment (NLTK on viewer messages, lagged) further conditions vocal effects:

Interaction β What it means
Sentiment × Energy → Subs −0.302*** Positive room ⇒ anchor's energy adds less to conversion
Sentiment × Stress → Chats −0.0090** Positive sentiment slightly attenuates the stress→chat channel

The social environment partially substitutes for what the anchor's voice provides.

Consistent with Social Influence Theory.

17

Robustness

Copula correction is necessary


Lag sensitivity — re-estimated with two-minute lag (t−2)


Other robustness (in appendix)

18

Strategic recommendations

Audience Strategy Why
New anchors Match sentiment Sentiment amplifies energy → engagement for small audiences
Established anchors Differentiate Popular channels chat themselves; reserve energy for conversion
All anchors Leverage stress carefully Stress drives engagement, but excess tension hurts subs in big channels
19

New since submission — replication data

New wave

Nov 2025 Jan 2026


1,138

Twitch VODs, ~22 TB of video


Storage

AWS S3; per-minute panel structure preserved

Purpose
  • Demonstrate findings generalize over time
  • Pre-empt reviewer concerns about a single time window
  • Enable response to new variable requests (e.g., visual semantics)
20

Methodology upgrade

From proprietary closed-box to open-source, fully reproducible pipeline.

Construct Original New
Audio features Nemesysco QA5 openSMILE eGeMAPSv02 (88 features)
Video dynamics SSIM only SSIM + scene detection + CLIP embeddings
Audio quality MOSNet SQUIM + LUFS + voice activity + SNR
Linguistic controls Chat sentiment + transcript: speech rate, hesitations, lexical diversity, F–K grade

Transparent, replicable, finer-grained, extensible.

21
Replication preview — eGeMAPS factor structure

Validating the new pipeline

Factor analysis on the new wave (34,000+ minute-observations) recovers a 3-factor structure that mirrors the original:

Factor Top loadings Interpretation
F1 Formant amplitudes (F1, F2, F3 relative to F0) Expressive Energy (loudness/projection)
F2 Formant frequencies + unvoiced segment length Cognitive Tones (articulation precision)
F3 Alpha ratio, Hammarberg index, MFCC2 Stressed Tones (spectral tension)

Same 3-factor structure as Nemesysco — same constructs, transparent measurement.

22
What the new pipeline measures — descriptive stats from new wave

Per-minute feature snapshot

Feature Mean SD What it captures
LUFS loudness −29.8 dB 5.2 Broadcast loudness (typical streaming range)
VAD ratio 0.34 0.27 Anchor speaking ~1/3 of the time
Speech rate 80 w/min 59 Conversational pace, high variance
Lexical diversity 0.64 0.16 Type-token ratio in transcripts
Intra-minute SSIM 0.62 0.17 Visual stability across seconds within minute
Inter-minute CLIP cosine Semantic shift between consecutive minutes

Distributions are well-behaved and theoretically interpretable across all 1,131 streams.

23

Contributions

  1. Engagement is multidimensional Different psychological pathways for different behaviors.
  2. Stress is not always negative It can function as an authenticity signal in digital communication.
  3. Social Approval and Social Influence work together Main effects plus contextual moderation.
  4. A methodological template For analyzing multimodal interactions in digital settings.
24

Limitations & future research

Limitations
  • Top-25 streamers — long tail
  • Single platform (Twitch)
  • Single category (gaming)
In progress
  • Replication on 2025–26 wave (features extracted, regression next)
  • Per-minute facial semantics (CLIP)
  • Cross-category generalization (Just Chatting, IRL, music)
  • Live broadcast vs. VOD viewer differences
25

Thank you.

Qingli Zeng · Hebrew University of Jerusalem

qingli.zeng@mail.huji.ac.il