Recent advances in image detection through deep algorithms such as Faster-RCNN or YOLO have lead to significant improvement of the state-of-the-art. These algorithms excel in subdividing images in zones of interest, and then recognising the key elements composing the original image.

Project goals

The goal of this project is to transpose this approach to audio. More specifically, we would like to create a 2D audio map, along the x and y axes. As a result we have two main goals: 1) localising audio sources 2) recognising the audio source.
During the eNTERFACE project we will be focusing on two main categories of audio sources: home environment (closing doors, vacuum cleaner, blender, etc.) and crowd behaviour (applause, booing, cheering), as our goal is to reuse to obtained results for two other larger projects: IGLU and DeepSport.

Project Leaders

  • Gueorgui Pironkov (
  • St├ęphane Dupont (

Project organisation

WP1: Collecting data
WP2: Event detection and localisation
WP3: Event classification
WP4: Optimisation of the different modules


  • Deliverable 1: Small audio database annotated both in terms of localization and source.
  • Deliverable 2: Code for the audio scene analysis.