- Author: Daniel Downey
- Date: October 24, 2020
Almost everyone who’s played a dialogue-heavy RPG has probably, at some point, had their immersion broken by a stiff-necked, blank-faced NPC dishing out emotional dialogue while looking like a robot. Even in modern games out of sync gestures, uncanny facial expressions, and lips that look like two pieces of toast slapping together tend to be par-for-the-course when it comes to your average conversation with NPCs.
Luckily, CD Projekt Red is making sure there are no animatronic NPCs in Cyberpunk 2077. CDPR is utilizing relatively new lip-sync technology created by Canada’s JALI Inc. in order to ensure NPCs are expressive and alive when they speak, with facial expressions and lip and jaw movement that accurately reflect the lines they are speaking. In a recent presentation at Siggraph 2020, Mateusz Popławski of CDPR and some of the JALI developers gave a twenty minute presentation on JALI and its use in Cyberpunk 2077. For those of you that don’t want to watch a twenty minute video, we’ve tried to give a brief but detailed overview of their presentation.
What is JALI?
In 2016, Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh published an academic paper which gave an overview of their work on their new lip-sync tech, entitled JALI. JALI stands for Jaw and Lip, because the JALI 3D facial rig replicates human facial movement during speech by simulating natural movement of those two key muscle groups. While there are other aspects of speech production in humans, the jaw and the lips are the two most visible elements of speech production, and therefore crucial to a visually realistic portrayal of speech.
In brief, here’s how the tech works:
- INPUT: Audio, text transcript, and any tags (emotion, languages) are fed into JALI;
- ANALYSIS: JALI analyzes the input, looking at volume, pitch, phoneme timings, and a rule-based AI system that ties lip and jaw position to phonemes (specific units of sound);
- OUTPUT: JALI’s output can be modified in a number of ways, including primarily how much the lips and jaws move on each phoneme.
In other words, JALI can take a string of speech, and then making the model human face look as realistic as possible as it delivers it, from how big and wide the mouth opens, to how the muscles around it move.
Using JALI also means less time and money spent on expensive motion capture or time consuming animation. Given Cyberpunk 2077’s giant script, this seems like an incredibly valuable method of lip-syncing the voice lines.
Once More, With Feeling
JALI is also capable of simulating not just the one-to-one correspondence of lip and jaw movement to speech, but also includes a range of emotions that can be applied to a lip-sync animation, which in turn changes how the animation appears. With XML-style tags, the input text transcript can be modified to mark a given line with a specific emotion. Here is the example used in the Siggraph 2020 video:
<happy-50> You’re home! </happy-50> <fear – 20> Wait, what? </fear-20>
<fear-130> AAAAHHHH!! </fear-130>
These emotion tags will then influence how exactly the face appears when it is lip-syncing the dialogue.
In addition to the obvious importance of lip and jaw movement matching speech correctly, there are also three other important pieces of how speech is represented on the speaker’s face that dramatically influence how realistic a lip-sync appears: brow movement, head movement, and the eyes.
For the brows, JALI can determine where a given sentence has strong inflection points, and have the speaker’s brow furrow at those moments. The brows are also influenced by emotions, with negative emotions furrowing and tightening the brows, and positive emotions raising and relaxing them. We move our heads (by moving our neck) while we speak as well, and the JALI researched watched hours of video in order to effectively replicate realistic looking neck movement during speech.
The eyes are another common sticking point in 3D animation, and the JALI team paid careful attention to how and why our eyes blink and move when they created their system for their motion and blink models. Eyes blink both for maintenance and for cognitive reasons (we blink as we think), and so the JALI blink model uses both the audio analysis, the lexical analysis of the text transcript, and the time since last blink to determine when the next blink should occur. Eye motion is equally complex, and has two parts. Either the eye is focuses on something specific, or a statistical model is used to accurately simulate pupil motion when there isn’t a fixed point of interest.
Another advantage of the JALI tech is it’s use in localization. Usually, lip-sync animations are matched to one language, and everyone playing in a different language has to suffer through awful lip-syncs. Not so in Cyberpunk 2077: The JALI system has been optimized for 10 languages, so the lip-sync animations should match whatever language you choose to play the game in.
Bringing Night City to Life
What all of this means, ultimately, is that the NPCs in Cyberpunk 2077 should come across as more alive, more real, than your average NPC in a modern RPG. By using JALI’s lip-sync technology, CD Projekt Red will be able to give players the sense that they are really there, having a conversation, when they interact with the characters in-game. This, in turn, will make NPCs feel more important — letting some stiff-necked, robotic companion die as they shuffle awkwardly around corners is one thing, but letting a character die after they’ve pleaded for help (with realistically raised eyebrows) with a face that almost seemed real… well, that’s something else entirely.
What do you think? Does good lip-sync matter when you’re playing an RPG? Let us know in the comments!