george evangelopoulos

Representation learning for speech recognition

SpeechInvariance.png

The recognition of sound categories and speech in the human brain is remarkably robust to signal variations that preserve identity. The goal of this project is to provide a theoretical and computational framework for speech representation learning, spanning both machine learning and neuroscience. The methods are motivated by a theory for invariant representation learning, models for recognition in the cortex and memory-based learning. We are focusing on the theoretical formulation and implementation of biologically-plausible, computational models and statistical learning algorithms and the development of acoustic signal representations, using deep learning, hierarchical neural networks and unsupervised representation learning.

More info: [CBMM research], [Poggio Lab, MIT], [LCSL, IIT/MIT]

  • Discriminate-and-Rectify Encoders: Learning from Image Transformation Sets,

    A. Tacchetti, S. Voinea, and
    G. Evangelopoulos
    ,
    Center for Brains, Minds and Machines (CBMM) Memo No. 62, arXiv:1703.04775v1, Mar. 2017.
    [arXiv:1703.04775] [pdf] [bibtex]
  • Representation Learning from Orbit Sets for One-Shot Classification,

    A. Tacchetti, S. Voinea,
    G. Evangelopoulos
    and T. Poggio,
    AAAI Spring Symposium Series: Science of Intelligence, 2017.
    [AAAI] [pdf] [bibtex]
  • Discriminative Template Learning in Group-convolutional Networks for Invariant Speech Representations,

    C. Zhang, S. Voinea,
    G. Evangelopoulos
    , L. Rosasco and T. Poggio,
    Proc. INTERSPEECH 2015 - 16th Annual Conf. of the International Speech Communication Association, Dresden, Germany, Sep. 06-10, 2015.
    [ISCA] [pdf] [bibtex] [poster]
  • Learning An Invariant Speech Representation,

    G. Evangelopoulos
    , S. Voinea, C. Zhang, L. Rosasco and T. Poggio,
    Center for Brains, Minds and Machines (CBMM) Memo No. 22, arXiv:1406.3884v1, June 2014.
    [arXiv:1406.3884]
  • Word-level Invariant Representations From Acoustic Waveforms,

    S. Voinea, C. Zhang,
    G. Evangelopoulos
    , L. Rosasco and T. Poggio,
    Proc. INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association, (ISCA Best Student Paper Award), Singapore, Sep. 14-18, 2014.
    [pdf] [bibtex]
  • Phone Classification by a Hierarchy of Invariant Representation Layers,

    C. Zhang, S. Voinea,
    G. Evangelopoulos
    , L. Rosasco and T. Poggio,
    Proc. INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association, Singapore, Sep. 14-18, 2014.
    [pdf] [bibtex]
  • A Deep Representation for Invariance and Music Classification,

    C. Zhang,
    G. Evangelopoulos
    , S. Voinea, L. Rosasco and T. Poggio,
    Proc. IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 4-9, 2014
    [arXiv:1404.0400] [IEEE] [bibtex] [slides]

Audio saliency

SpeechAudio.png

Audio signals with multiple, mixed sound sources convey both acoustic and semantic content, such as linguistic, identity, emotion (speech), genre (music) or auditory scene. We develop audio saliency models to capture the sensory prominence of sound signals using a model of non-stationary sinusoids in multiple frequencies, whose instantaneous amplitude and frequency are estimated by efficient demodulation algorithms. A compact representation of energy and spectral content is obtained by tracking the dominant resonance component across frequency bands. Amplitude and frequency changes are important for aural saliency and auditory scene analysis, while modulations are related to temporal sound properties involved in auditory grouping and source recognition. Audio saliency models have applications on areas such as speech and audio detection, audio saliency modeling, event detection, audio classification and summarization.

More info: [NTUA research project]

  • A Saliency-Based Approach to Audio Event Detection and Summarization,

    A. Zlatintsi, P. Maragos, A. Potamianos and
    G. Evangelopoulos,
    Proc. 20th European Signal Processing Conf. (EUSIPCO), Bucharest, Romania, Aug. 2012
    [pdf] [bibtex]
  • Audio-assisted Movie Dialogue Detection,

    M. Kotti, D. Ververidis,
    G. Evangelopoulos
    , I. Panagakis, C.Kotropoulos, P. Maragos and I. Pitas,
    IEEE Transactions on Circuits and Systems for Video Technology, special issue on Event Analysis in Videos, vol. 18, no. 11, pp. 1618-1627, Nov. 2008
    [pdf] [bibtex]
  • Multiband Modulation Energy Tracking for Noisy Speech Detection,

    G. Evangelopoulos
    and P. Maragos,
    IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 6, pp. 2024-2038, Nov. 2006
    [pdf] [bibtex]
  • Speech Event Detection Using Multiband Modulation Energy,

    G. Evangelopoulos
    and P. Maragos,
    Proc. Interspeech 2005 - Eurospeech, Lisbon, Portugal, 4-8 Sep. 2005, pp. 685-688
    [pdf] [bibtex] [poster]