EMAC 2026 · DIGITAL MARKETING & SOCIAL MEDIA TRACK

Anchors' Voice Characteristics and Viewer Engagement in Live Streams

Qingli Zeng with co-authors Sandeep Chandukala, Hai Che, Lifeng Yang
Hebrew University of Jerusalem

The puzzle

In live streaming, the anchor's voice is the only continuous channel of communication.

Hands are on the game — no text chat
Face cam is small, often partially obscured
Yet creators earn millions in subscriptions

How does how they speak — not what they show — drive viewer engagement?

02

Why it matters

$20B+

global live-streaming market, growing 25% YoY

6–7 figures

annual subscription earnings for top streamers

3 platforms

Twitch, YouTube Gaming, and Kick competing fiercely

0 papers

isolating vocal cues from other engagement drivers

03

The live-streaming ecosystem

Twitch stream interface showing anchor face cam, gameplay video, and chat panel

CLICK TO REVEAL →

Click the image → step through each part of the live-streaming ecosystem.

04

Two kinds of engagement

	Immediate — Chats	Sustained — Subscriptions
What	Real-time messages	Paid recurring commitment
Nature	Impulsive, social	Deliberate, financial
Drives	Community vibe	Revenue, loyalty

These are not the same thing. They likely respond to different vocal cues.

05

Framework

Anchors' Vocal Characteristics

Positivity
Stress
Cognitive

Chatroom Sentiment

↓

→

↑

Channel Popularity

Audience's Engagements

Chats (immediate)
Subscriptions (sustained)

Controls

Linguistic

Anticipation
Hesitation
Concentration

Video

Dynamics
HSV

Audio

MOSNet
Harmonic

06

Theoretical framework

Social Approval Theory — Bundy & Pfarrer (2015)

Emotional appeals receive higher evaluations → drives sustained commitment.

Social Influence Theory — Cialdini (2006)

Persuasion effects depend on social context → drives moderation.

Together: different vocal cues map to different engagement types — and the social environment shifts these effects.

07

Three vocal dimensions

Dimension 01

Expressive
Energy

enthusiasm,
emotional intensity

Dimension 02

Stressed
Tones

tension, pressure,
audible effort

Dimension 03

Cognitive
Tones

analytical thinking,
information processing

Each carries a distinct social signal.

08

Variable definition

Energy

Anchor's vocal energy — enthusiasm, loudness range, and pitch dynamics in the audio signal.

Low energy example

Flat, low-affect delivery

High energy example

Peak excitement, intense affect

Aggregated per minute from second-level audio measurements; standardized within streamer to remove baseline differences.

09

Variable definition

Stress Level

Audible vocal stress — indexed from jitter, shimmer, voice quality irregularities, and prosodic markers of tension.

Low stress example

Relaxed, controlled phonation

High stress example

High tension, audible effort

Interpreted as a signal of authenticity / emotional investment — see Finding 02.

10

Hypotheses

Expressive Energy → subscriptions, not chats
Stressed tones → both chats and subs (authenticity signal)
Cognitive depth → subscriptions (competence signal)
Channel popularity moderates (halo for validated anchors)
Chatroom sentiment moderates (positive vibe = less need for anchor energy)

11

Data

Top 25

streamers, ~35% of platform viewership

1,000+ h

of streaming content

Modalities

Audio at 40,000 Hz
HD video
Chat logs (timestamped)
Subscription records (timestamped)

Panel structure

streamer × stream × minute

12

Empirical model

Negative Binomial regression with Gaussian Copula endogeneity correction

μi,t = βA·Audioi,t−1 + βM·Modi,t−1 + βI·(Audio × Mod) + βC·Controlsi,t−1 + λi + εi,t

Controls

Video dynamics, visual attributes, audio quality, lagged outcomes, anchor fixed effects.

Why copula

Addresses reverse causality (viewers shape anchor's voice) without needing an external instrument.

13

Finding 01 — engagement is not unidimensional

Vocal cue	→ Chats	→ Subscriptions
Expressive Energy	ns (β=0.0037)	+ β=0.0502***
Stressed	+ β=0.0037***	+ β=0.0377***
Cognitive	ns (β=0.0118)	+ β=0.0440**

Two distinct psychological pathways:

Chats — impulsive, social → triggered only by audible effort
Subscriptions — deliberate, financial → require energy + effort + competence

*** p<0.01, ** p<0.05 (baseline NB + copula, no interactions yet)

14

Finding 02 — the stress paradox

Contrary to intuition:

Stressed vocal tones positively affect engagement, both chats (β=0.0037***) and subscriptions (β=0.0377***).

Interpretation. One possibility: stress may signal authenticity and emotional investment — rewarded in a media landscape saturated with polished, edited content. But the mechanism warrants further investigation.

15

Finding 03 — popularity moderates

Channel popularity (top-5 binary) reshapes both energy and stress effects:

Interaction	β	What it means
Popularity × Energy → Subs	+0.0285***	Halo amplifies energy's persuasive force on conversion
Popularity × Energy → Chats	−0.0177***	Big crowds chat anyway — anchor energy matters less
Popularity × Stress → Chats	+0.0094***	Stress becomes a "rallying cry" in popular channels
Popularity × Stress → Subs	−0.0226***	Established audiences want polish, not tension, for paid commitment

Pre-validation by popularity flips how each vocal cue lands.

16

Finding 03 — sentiment moderates

Chatroom sentiment (NLTK on viewer messages, lagged) further conditions vocal effects:

Interaction	β	What it means
Sentiment × Energy → Subs	−0.302***	Positive room ⇒ anchor's energy adds less to conversion
Sentiment × Stress → Chats	−0.0090**	Positive sentiment slightly attenuates the stress→chat channel

The social environment partially substitutes for what the anchor's voice provides.

Consistent with Social Influence Theory.

17

Robustness

Copula correction is necessary

Copula terms significant (p<0.05) → viewer behavior does feed back into anchor's voice
Without correction, audio coefficients would be biased

Lag sensitivity — re-estimated with two-minute lag (t−2)

Stress and energy effects: stable
Cognitive effects: attenuate
Cognitive cues may be more transient than emotional ones.

Other robustness (in appendix)

Alternative copula seeds, standardization choices, anchor-specific clustering — all consistent

18

Strategic recommendations

Audience	Strategy	Why
New anchors	Match sentiment	Sentiment amplifies energy → engagement for small audiences
Established anchors	Differentiate	Popular channels chat themselves; reserve energy for conversion
All anchors	Leverage stress carefully	Stress drives engagement, but excess tension hurts subs in big channels

19

New since submission — replication data

New wave

Nov 2025 — Jan 2026

1,138

Twitch VODs, ~22 TB of video

Storage

AWS S3; per-minute panel structure preserved

Purpose

Demonstrate findings generalize over time
Pre-empt reviewer concerns about a single time window
Enable response to new variable requests (e.g., visual semantics)

20

Methodology upgrade

From proprietary closed-box to open-source, fully reproducible pipeline.

Construct	Original	New
Audio features	Nemesysco QA5	openSMILE eGeMAPSv02 (88 features)
Video dynamics	SSIM only	SSIM + scene detection + CLIP embeddings
Audio quality	MOSNet	SQUIM + LUFS + voice activity + SNR
Linguistic controls	Chat sentiment	+ transcript: speech rate, hesitations, lexical diversity, F–K grade

Transparent, replicable, finer-grained, extensible.

21

Replication preview — eGeMAPS factor structure

Validating the new pipeline

Factor analysis on the new wave (34,000+ minute-observations) recovers a 3-factor structure that mirrors the original:

Factor	Top loadings	Interpretation
F1	Formant amplitudes (F1, F2, F3 relative to F0)	Expressive Energy (loudness/projection)
F2	Formant frequencies + unvoiced segment length	Cognitive Tones (articulation precision)
F3	Alpha ratio, Hammarberg index, MFCC2	Stressed Tones (spectral tension)

Same 3-factor structure as Nemesysco — same constructs, transparent measurement.

22

What the new pipeline measures — descriptive stats from new wave

Per-minute feature snapshot

Feature	Mean	SD	What it captures
LUFS loudness	−29.8 dB	5.2	Broadcast loudness (typical streaming range)
VAD ratio	0.34	0.27	Anchor speaking ~1/3 of the time
Speech rate	80 w/min	59	Conversational pace, high variance
Lexical diversity	0.64	0.16	Type-token ratio in transcripts
Intra-minute SSIM	0.62	0.17	Visual stability across seconds within minute
Inter-minute CLIP cosine	—	—	Semantic shift between consecutive minutes

Distributions are well-behaved and theoretically interpretable across all 1,131 streams.

23

Contributions

Engagement is multidimensional Different psychological pathways for different behaviors.
Stress is not always negative It can function as an authenticity signal in digital communication.
Social Approval and Social Influence work together Main effects plus contextual moderation.
A methodological template For analyzing multimodal interactions in digital settings.

24

Limitations & future research

Limitations

Top-25 streamers — long tail
Single platform (Twitch)
Single category (gaming)

In progress

Replication on 2025–26 wave (features extracted, regression next)
Per-minute facial semantics (CLIP)
Cross-category generalization (Just Chatting, IRL, music)
Live broadcast vs. VOD viewer differences

25

Thank you.

Qingli Zeng · Hebrew University of Jerusalem

qingli.zeng@mail.huji.ac.il

Anchors' Voice Characteristics and Viewer Engagement in Live Streams

The puzzle

Why it matters

The live-streaming ecosystem

Two kinds of engagement

Framework

Theoretical framework

Three vocal dimensions

ExpressiveEnergy

StressedTones

CognitiveTones

Energy

Stress Level

Hypotheses

Data

Empirical model

Finding 01 — engagement is not unidimensional

Finding 02 — the stress paradox

Finding 03 — popularity moderates

Finding 03 — sentiment moderates

Robustness

Strategic recommendations

New since submission — replication data

Methodology upgrade

Validating the new pipeline

Per-minute feature snapshot

Contributions

Limitations & future research

Thank you.

Expressive
Energy

Stressed
Tones

Cognitive
Tones