Skip to content

Projects

Ongoing projects

SoundFutures CDT (2024–2032)

EPSRC CDT in Sustainable Sound Futures

The Sustainable Sounds Futures CDT is a leading doctoral training centre, funded by EPSRC in collaboration with the universities of Salford, Sheffield, Bristol, and Southampton. Working with over fifty project partners from industry and government, this initiative delivers an unmatched level of expertise and cutting-edge facilities for Acoustics PhD training.

Clarity (2019–2025)

Challenges to Revolutionise Hearing Device Processing

Clarity is a 5 year EPSRC project in collaboration with the University of Cardiff (Psychology), University of Nottingham (Medicine), University of Salford (Comp Sci) and with the support of the Hearing Industry Research Consortium, Action for Hearing Loss, Amazon and Honda.

The project aims to transform hearing-device research by the introduction of open evaluations ("challenges") similar to those that have been the driving force in many other fields of speech technology. The project will develop the simulation tools, models, databases and listening test protocols needed to facilitate such challenges. We will develop simulators to create different listening scenarios and baseline models to predict how hearing-impaired listeners perceive speech in noise. Data will also include the results of large-scale speech-in-noise listening tests along with a comprehensive characterisation of each test subject's hearing ability. These data and tools will form a test-bed to allow other researchers to develop their own algorithms for hearing aid processing in different listening scenarios. The project will run three challenge cycles with steering from industry partners and the speech and hearing research communities.

Project Website: http://claritychallenge.org

Cadenza (2022–2027)

Machine Learning Challenges to Revolutionise Music Listening for People with Hearing Loss

Cadenza is a 4.5 year EPSRC project in collaboration with the University of Salford (Comp Sci), University of Leeds () and University of Nottingham (Medicine) and with the support of the BBC, Google, Logitech, Sonova AG and user engagement via Royal National Institute for the Deaf (RNID).

1 in 6 people in the UK has a hearing loss, and this number will increase as the population ages. Poorer hearing makes music harder to appreciate. Picking out lyrics or melody lines is more difficult; the thrill of a musician creating a barely audible note is lost if the sound is actually inaudible, and music becomes duller as high frequencies disappear. This risks disengagement from music and the loss of the health and wellbeing benefits it creates.

The project will look at personalising music so it works better for those with a hearing loss.

The project will consider:

  1. Processing and remixing mixing desk feeds for live events or multitrack recordings.
  2. Processing of stereo recordings in the cloud or on consumer devices.
  3. Processing of music as picked up by hearing aid microphones.

The project aims to accelerate research in this area by organising a series of signal processing challenges. These challenge will grow a collaborative community who can apply their skills and knowledge to this problem area.

The project will be developing tools, databases and objective models needed to run the challenges. This will lower barriers that currently prevent many researchers from considering hearing loss. Data would include the results of listening tests into how real people perceive audio quality, along with a characterisation of each test subject's hearing ability, because the music processing needs to be personalised. We will develop new objective models to predict how people with a hearing loss perceive audio quality of music. Such data and tools will allow researchers to develop novel algorithms.

Project Website: http://cadenzachallenge.org

SLT CDT (2021–2028)

UKRI CDT in Speech and Language Technologies and their Applications

Speech and Language Technologies (SLTs) are a range of Artificial Intelligence (AI) approaches which allow computer programs or electronic devices to analyse, produce, modify or respond to spoken and written language. They enable natural interaction between people and computers, translation between all human languages, and analysis of speech and text.

Past projects

TAPAS (2017–2021)

Training Network on Automatic Processing of PAthological Speech

TAPAS is an H2020 Marie Curie Initial Training Network that will provide research opportunities for 15 PhD students (Early Stage Researchers) to study automatic processing of pathological speech. The network consists of 12 European research institutes and 9 associated partners.

The TAPAS work programme targets three key research problems:

  • Detection: We will develop speech processing techniques for early detection of conditions that impact on speech production. The outcomes will be cheap and non-invasive diagnostic tools that provide early warning of the onset of progressive conditions such as Alzheimer's and Parkinson's.
  • Therapy: We will use newly-emerging speech processing techniques to produce automated speech therapy tools. These tools will make therapy more accessible and more individually targeted. Better therapy can increase the chances of recovering intelligible speech after traumatic events such a stroke or oral surgery.
  • Assisted Living: We will re-design current speech technology so that it works well for people with speech impairments. People with speech impairments often have other co-occurring conditions making them reliant on carers. Speech-driven tools for assisted-living are a way to allow such people to live more independently.

The TAPAS consortium includes clinical practitioners, academic researchers and industrial partners, with expertise spanning speech engineering, linguistics and clinical science. This rich network will train a new generation of 15 researchers, equipping them with the skills and resources necessary for lasting success.

AV-COGHEAR (2015–2018)

Towards visually-driven speech enhancement for cognitively-inspired multi-modal hearing-aid devices

AV-COGHEAR is an EPSRC-funded project that is being conducted in collaboration with the University of Stirling. Current commercial hearing aids use a number of sophisticated enhancement techniques to try and improve the quality of speech signals. However, today's best aids fail to work well in many everyday situations. In particular, they fail in busy social situations where there are many competing speech sources; they fail if the speaker is too far from the listener and swamped by noise. We have identified an opportunity to solve this problem by building hearing aids that can 'see'.

AV-COGHEAR aims to develop a new generation of hearing aid technology that extracts speech from noise by using a camera to see what the talker is saying. The wearer of the device will be able to focus their hearing on a target talker and the device will filter out competing sound. This ability, which is beyond that of current technology, has the potential to improve the quality of life of the millions suffering from hearing loss (over 10m in the UK alone).

The project is bringing together a researchers with the complementary expertise necessary to make the audio-visual hearing-aid possible. The project combines contrasting approaches to audio-visual speech enhancement that have been developed by the Cognitive Computing group at Stirling and the Speech and Hearing Group at Sheffield. The Stirling approach uses the visual signal to filter out noise; whereas the Sheffield approach uses the visual signal to fill in 'gaps' in the speech. The MRC Institute of Hearing Research (IHR) will provide the expertise needed to evaluate the approach on real hearing loss sufferers. Phonak AG, a leading international hearing aid manufacturer, is providing the advice and guidance necessary to maximise potential for industrial impact.

DeepArt (2017–2018)

Deep learning of articulatory-based representations of dysarthric speech

DeepArt is a Google Faculty Award project that is targeting dysarthria, a particular form of disordered speech arising from poor motor-control and a resulting lack of coordination of the articulatory system. At Sheffield, we have demonstrated that using state-of-the-art training techniques developed for mainstream HMM/DNN speech recognition, can raise baseline performance for dysarthric speech recognition.

The DeepArt project will aim to advance the state of the art by conducting research in three key areas:

  • articulatory based representations;
  • use of synthetic training data and
  • novel approaches to DNN based speaker adaptive training.
INSPIRE (2012–2016)

Investigating Speech in Real Environments

INSPIRE is an FP7 Marie Curie Initial Training Network that will provide research opportunities for 13 PhD students (Early Stage Researchers) and 3 postdocs (Experienced Researchers) to study speech communication in real-world conditions. The network consists of 10 European research institutes and 7 associated partners (5 businesses and 2 academic hospitals). The senior researchers in the network are academics in computer science, engineering, psychology, linguistics, hearing science, as well as R&D scientists from leading businesses in acoustics and hearing instruments, and ENT specialists. The scientific goal of INSPIRE is to better understand how people recognise speech in real life under a wide range of conditions that are "non-optimal" relative to the controlled conditions in laboratory experiments, e.g., speech in noise, speech recognition under divided attention.

CHiME (2009–2012)

Computational Hearing in Multisource Environments

CHiME was an EPSRC funded project that aimed to develop a framework for computational hearing in multisource environments. The approach operates by exploiting two levels of processing that combine to simultaneously separate and interpret sound sources (Barker et al. 2010). The first processing level exploits the continuity of sound source properties to clump the acoustic mixture into fragments of energy belonging to individual sources. The second processing level uses statistical models of specific sound sources to separate fragments belonging to the acoustic foreground (i.e. the `attended' source) from fragments belonging to the background.

The project investigated and develop key aspects of the proposed two-level hearing framework:

  • statistical tracking models to represent sound source continuity;
  • approaches for combining statistical models of foreground and background sound sources
  • approximate search techniques for decoding acoustic scenes in real-time
  • strategies for learning sound source models directly from noisy audio data

CHiME built a demonstration system simulating a speech-driven home-automation application operating in a noisy domestic environment.

References

Barker, J., N. Ma, A. Coy, and M. Cooke. 2010. "Speech Fragment Decoding Techniques for Simultaneous Speaker Identification and Speech Recognition." Computer Speech and Language 24 (1): 94--111. https://doi.org/10.1016/j.csl.2008.05.003.

S2S (2007–2011)

Sound to Sense

Sound to Sense (S2S) was an EC-funded Marie Curie Research Training Network exploring how humans and computers understand speech. My interest within the network centres on modelling human word recognition in noise (Barker and Cooke 2007; Cooke et al. 2008).

References

Barker, J., and M. Cooke. 2007. "Modelling Speaker Intelligibility in Noise." Speech Communication 49 (5): 402--17. https://doi.org/10.1016/j.specom.2006.11.003.
Cooke, M., M. L. Garcia Lecumberri, and J. P. Barker. 2008. "The Foreign Language Cocktail Party Problem: Energetic and Informational Masking Effects in Non-Native Speech Perception." Journal of the Acoustical Society of America 123 (1): 414--27. https://doi.org/doi:10.1121/1.2804952.

POP (2006–2009)

Perception on Purpose

POP was a three year EC FP6 Specific Targeted Research project that combined auditory scene analysis and vision on robotic platforms. A key achievement in the audio processing was the combination of binaural source localisation techniques (Harding et al. 2006) with a spectro-temporal fragment-based sound source separation component to produce a robust sound source localisation implementation suitable for real time audio motor control (Christensen et al. 2009). We also spent some time on the tricky problems of trying to use acoustic location cues when the ears that are generating the estimates are themselves moving on unpredictable and possibly unknown trajectories (Christensen and Barker 2009).

The demo belows show a custom-made Audio Visual robot called Popeye that was developed as part of the project.

The project also constructed a small corpus of synchronised stereoscopic and binaural recordings (Arnaud et al. 2008) called CAVA which is freely available for download.

References

Arnaud, E., H. Christensen, Y-C. Lu, et al. 2008. "The CAVA Corpus: Synchronised Stereoscopic and Binaural Datasets with Head Movements." ICMI '08 Proceedings of the 10th International Conference on Multimodal Interfaces (Crete, Greece), October, 109--16. https://doi.org/10.1145/1452392.1452414.
Christensen, H., and J. Barker. 2009. "Using Location Cues to Track Speaker Changes from Mobile, Binaural Microphones." Proceedings of the 10th Annual Conference of the International Speech Communication Association (Interspeech 2009) (Brighton, UK), September. https://doi.org/10.21437/Interspeech.2009-52.
Christensen, H., N. Ma, S. N. Wrigley, and J. Barker. 2009. "A Speech Fragment Approach to Localising Multiple Speakers in Reverberant Environments." Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Taipei, Taiwan), April, 4593--96. https://doi.org/10.1109/ICASSP.2009.4960653.
Harding, S., J. Barker, and G. J. Brown. 2006. "Mask Estimation for Missing Data Speech Recognition Based on Statistics of Binaural Interaction. IEEE Trans. Speech and Audio Processing." IEEE Transactions on Audio, Speech and Language Processing 14 (1): 58--67. https://doi.org/10.1109/TSA.2005.860354.

AVASR (2004–2007)

Audio visual speech recognition in the presence of multiple speakers

This was an EPSRC project which looked at audio-visual speech recognition is 'cocktail party' conditions -- i.e. when there are several people speaking simultaneously. The work first showed that standard multistream AVASR approaches are not appropriate in these conditions (Shao and Barker 2008). The project then developed an audio-visual extension of the speech fragment decoding approach (Barker and Shao 2009), that, like humans, is able to exploit the visual signal not only for its phonetic content but also in its role as a cue for acoustic source separation. The latter role is also observed in human audio-visual speech processing where the visual speech input can produce an 'informational masking release' leading to increased intelligibility even in conditions where the visual signal provides little or no useful phonetic content.

The project also partially funded the collection of the AV Grid corpus (Cooke et al. 2006) which is available for download either from the University of Sheffield or from from Zenodo.

Demos of a face marker tracking tool (Barker 2005) that was built at the start of the project can be found here.

References

Barker, J. 2005. "Tracking Facial Markers with an Adaptive Marker Collocation Model." Proceedings of the 2005 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Philadelphia, PA), March, 665--68. https://doi.org/10.1109/ICASSP.2005.1415492.
Barker, J., and X. Shao. 2009. "Energetic and Informational Masking Effects in an Audio-Visual Speech Recognition System." IEEE Transactions on Audio, Speech and Language Processing 17 (3): 446--58. https://doi.org/10.1109/TASL.2008.2011534.
Cooke, M., J. Barker, S. Cunningham, and X. Shao. 2006. "An Audio-Visual Corpus for Speech Perception and Automatic Speech Recognition." Journal of the Acoustical Society of America 120 (5): 2421--24. https://doi.org/10.1121/1.2229005.
Shao, X., and J. P. Barker. 2008. "Stream Weight Estimation for Multistream Audio-Visual Speech Recognition in a Multispeaker Environment." Speech Communication 50 (4): 337--53. https://doi.org/10.1016/j.specom.2007.11.002.

Multisource (2002–2005)

Multisource decoding for speech in the presence of other sound sources

This was an EPSRC funded project that aimed "to generalise Automatic Speech Recognition decoding algorithms for natural listening conditions, where the speech to be recognised is one of many sound sources which change unpredictably in space and time". During this project we continued the development of the Speech Fragment Decoding approach (that was begun towards the end of the RESPITE project) leading to a theoretical framework published in (Barker et al. 2005). Also during this time we experimented with applications of the missing data approach to binaural conditions (Harding et al. 2006) and as a technique for handling reverberation (Palomäki et al. 2004).

References

Barker, J., M. P. Cooke, and D. P. W. Ellis. 2005. "Decoding Speech in the Presence of Other Sources." Speech Communication 45 (1): 5--25. https://doi.org/doi:10.1016/j.specom.2004.05.002.
Harding, S., J. Barker, and G. J. Brown. 2006. "Mask Estimation for Missing Data Speech Recognition Based on Statistics of Binaural Interaction. IEEE Trans. Speech and Audio Processing." IEEE Transactions on Audio, Speech and Language Processing 14 (1): 58--67. https://doi.org/10.1109/TSA.2005.860354.
Palomäki, K. J., G. J. Brown, and J. Barker. 2004. "Techniques for Handling Convolutional Distortion with \"Missing Data\" Automatic Speech Recognition." Speech Communication 43 (1--2): 123--42. https://doi.org/10.1016/j.specom.2004.02.005.

RESPITE (1999–2002)

Recognition of Speech by Partial Information TEchniques

Before taking up a lectureship I spent three years as a postdoc working on the EC ESPRIT funded RESPITE project. The project focused on researching and developing new methodologies for robust Automatic Speech Recognition based on missing-data theory and multiple classification streams. During the project soft missing data techniques were developed (Barker, Josifovski, et al. 2000) and competitively evaluated on the Aurora speech recognition task (Barker et al. 2001). At the same time, and in collaboration with Martin Cooke and Dan Ellis, the initial ideas for what became Speech Fragment Decoding were formulated (Barker, Cooke, et al. 2000). A separate collaboration with Andrew Morris and Herve Bourlard lead to a generalisation of the missing data approach ('soft data modelling') that is closely related to what is now know as 'uncertainty decoding' (Morris et al. 2001).

It was also during the RESPITE project that the CASA Toolkit (CTK) was developed. CTK aimed to provide a flexible and extensible software framework for the development and testing of Computational Auditory Scene Analysis (CASA) systems. The toolkit allowed auditory-based signal processing front-ends to be developed using a graphical interface (somewhat similar to Simulink). The toolkit also contained implementations of the various missing data speech recognition algorithms that have been developed at Sheffield. The front-end processing code has largely been made redundant by MATLAB, however we still use the CTK missing data and speech fragment speech recognition code.

References

Barker, J., M. P. Cooke, and D. P. W. Ellis. 2000. "Decoding Speech in the Presence of Other Sound Sources." Proceedings of the International Conference on Spoken Language Processing (Beijing, China), October. https://doi.org/10.21437/ICSLP.2000-803.
Barker, J., M. Cooke, and P. Green. 2001. "Robust ASR Based on Clean Speech Models: An Evaluation of Missing Data Techniques for Connected Digit Recognition in Noise." Proceedings of the 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Eurospeech 2001 (Aalborg, Denmark), September, 213--16. https://doi.org/10.21437/Eurospeech.2001-76.
Barker, J., L. Josifovski, M. P. Cooke, and P. D. Green. 2000. "Soft Decisions in Missing Data Techniques for Robust Automatic Speech Recognition." Proceedings of the 6th International Conference on Spoken Language Processing (Interspeech 2000) (Beijing, China), October. https://doi.org/10.21437/ICSLP.2000-92.
Morris, A. C., J. Barker, and H. Bourlard. 2001. "From Missing Data to Maybe Useful Data: Soft Data Modelling for Noise Robust ASR." Proceedings of the Worshop on Innovation in Speech Processing (WISP 2001) (Stratford-upon-Avon, UK). https://doi.org/10.25144/18432.

SPHEAR (1998–1999)

Speech Hearing and Recognition

Prior to RESPITE, I spent a year at ICP in Grenoble (now known as Gipsa-Lab) as a Postdoc on SPHEAR, an EC Training and Mobility of Researchers network. The twin goals of the network were to achieve better understanding of auditory information processing and to deploy this understanding in automatic speech recognition for adverse conditions. During the year I worked with Frédéric Berthommier and Jean-Luc Schwarz studying the relation between audio and visual aspects of the speech signal (Barker and Berthommier 1999a, 1999b; Barker et al. 1998).

References

Barker, J. P., and F. Berthommier. 1999a. "Estimation of Speech Acoustics from Visual Speech Features: A Comparison of Linear and Non-Linear Models." Proceedings of the ISCA Workshop on Auditory-Visual Speech Processing (AVSP) '99 (University of California, Santa Cruz), August. https://www.isca-archive.org/avsp_1999/barker99_avsp.html.
Barker, J. P., and F. Berthommier. 1999b. "Evidence of Correlation Between Acoustic and Visual Features of Speech." Proc. ICPhS '99 (San Francisco), August. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS1999/papers/p14_0199.pdf.
Barker, J. P., F. Berthommier, and J. L. Schwartz. 1998. "Is Primitive AV Coherence an Aid to Segment the Scene?" Proceedings of the ISCA Workshop on Auditory-Visual Speech Processing (AVSP) '98 (Sydney, Australia), November. https://www.isca-archive.org/avsp_1998/barker98_avsp.html.

SPRACH (1997–1998)

Speech Recognition Algorithms for Connectionist Hybrids

SPRACH was an ESPRIT Long Term Research Programme project running from 1995 to 1998 which I was employed on for a brief six month stint while completing my PhD thesis. I had some fun doing some audio segmentation work with Steve Renals (then at Sheffield). The SPRACH project was performing speech recognition on radio broadcasts using what was then called a `hybrid MLP/HMM' recogniser, i.e. an MLP is used to estimate phone posteriors which are then converted in likelihoods and decoded using an HMM in the usual manner. The audio-segmnetation work attempted to use features derived from the phone posteriors to segment the audio into regions that would be worth decoding (i.e. likely to give good ASR results) and regions that would not (i.e. either non-speech or very noisy speech regions) (Barker et al. 1998).

References

Barker, J. P., G. Williams, and S. Renals. 1998. "Acoustic Confidence Measures for Segmenting Broadcast News." Proc. ICSLP '98 (Sydney, Australia), November. https://doi.org/10.21437/ICSLP.1998-605.

PhD Thesis (1994–1997)

Auditory organisation and speech perception

My thesis work (Barker 1998; Barker and Cooke 1997), supervised by <a href="http://www.laslab.org/martin">Martin Cooke</a>, was inspired by a paper (<a href="http://www.ncbi.nlm.nih.gov/pubmed/8121955">Remez et al., 1994</a>), that had been recently published at the time, which employed experiments using a particular synthetic analogue of natural speech, known as '<a href="http://www.haskins.yale.edu/research/sws.html">sine wave speech</a>' (SWS), to apparently invalidate the <a href="http://en.wikipedia.org/wiki/Auditory_scene_analysis">auditory scene analysis</a> (ASA) account of perception-- at least, in as far as it showed that ASA did not seem to account for the perceptual organisation of speech signals. This was a big deal at the time because it raised doubt about whether computational models of auditory scene analysis (CASA) were worth pursuing as a technology for robust speech processing. The thesis confirmed Remez' observation that listeners can be prompted to hear SWS utterances as coherent speech percepts despite SWS seemingly lacking the acoustic 'grouping' cues that were supposedly essential for coherency under the ASA account. However, the thesis went on to demonstrate that the coherency of the sine wave speech percept is fragile -- e.g. listeners are not able to attend to individual SWS utterances when pairs SWS utterances are presented simultaneously (the 'sine wave speech cocktail party' (Barker and Cooke 1999)). Computational modelling studies indicated that, in fact, the fragility of SWS and the limited intelligibility of simultaneous sine wave speakers could be described fairly well by CASA-type models that combine bottom-up acoustic grouping rules and top-down models.

References

Barker, J. P. 1998. "The Relationship Between Auditory Organisation and Speech Perception: Studies with Spectrally Reduced Speech." PhD thesis, Sheffield University.
Barker, J. P., and M. P. Cooke. 1997. "Modelling the Recognition of Spectrally Reduced Speech." Proceeding of the Eurospeech '97 (Rhodes, Greece), September, 2127--30. https://doi.org/10.21437/Eurospeech.1997-562.
Barker, J. P., and M. P. Cooke. 1999. "Is the Sine-Wave Speech Cocktail Party Worth Attending?" Speech Communication 27 (3--4): 159--74. https://doi.org/10.1016/S0167-6393(98)00081-8.