Reports emerged this week that lip-reading technology could help solve crimes by deciphering what people caught on CCTV are saying. PA Agency news feeds picked up by many news outlets announced that the visual speech recognition technology developed by the University of East Anglia in Norwich can be used to determine what people are saying in situations where audio is not good enough to hear – such as on security camera footage.
Helen Bear, from the university’s school of computing science, said the technology could be applied to a wide range of situations from criminal investigations to entertainment: “Lip-reading has been used to pinpoint words footballers have shouted in heated moments on the pitch, but is likely to be of most practical use in situations where there are high levels of noise, such as in cars or aircraft cockpits.
“Crucially, whilst there are still improvements to be made, such a system could be adapted for use for a range of purposes – for example, for people with hearing or speech impairments.”
Some sounds like “P” and “B” look similar on the lips and have traditionally been hard to decipher, the researchers said. But now the machine lip-reading technology can differentiate between the sounds for a more accurate translation.
Co-creator Richard Harvey said: “Lip-reading is one of the most challenging problems in artificial intelligence so it’s great to make progress on one of the trickier aspects, which is how to train machines to recognise the appearance and shape of human lips.”
According to the UEA website current techniques of lip-reading use techniques that have been successful in audio speech recognition. But in lip reading there are no phonemes, there are only gestures (sometimes called visemes) which are poorly defined.
It seems human lip-readers scan a visual sequence looking for characteristic gestures (e.g. the “F’ lip shape). These are infrequent compared to phonemes but are reasonably reliable. So, in contrast to audio recognition where there is a dense stream of phonemes, we have a stream of unknown elements interspersed with sparse gestures. This is a classic temporal learning scheme and is more analogous to event detection, change-point detection than speech recognition.
The project proposes to build classifiers to recognise these sparse events and hence augment or replace the HHM recogniser and so improve the robustness of computer lip-reading.