Audio Features for Music Emotion Recognition: A Survey

The design of meaningful audio features is a key need to advance the state-of-the-art in music emotion recognition (MER). This article presents a survey on the existing emotionally-relevant computational audio features, supported by the music psychology literature on the relations between eight musical dimensions (melody, harmony, rhythm, dynamics, tone color, expressivity, texture and form) and specific emotions. Based on this review, current gaps and needs are identified and strategies for future research on feature engineering for MER are proposed, namely ideas for computational audio features that capture elements of musical form, texture and expressivity that should be further researched. Previous MER surveys offered broad reviews, covering topics such as emotion paradigms, approaches for the collection of ground-truth data, types of MER problems and overviewing different MER systems. On the contrary, our approach is to offer a deep and specific review on one key MER problem: the design of emotionally-relevant audio features.


M USIC Emotion Recognition (MER) is attracting increas-
ing interest from the Music Information Retrieval (MIR) research community. In fact, as pointed out by David Huron nearly 20 years ago, "music's preeminent functions are social and psychological", and so "the most useful retrieval indexes are those that facilitate searching in conformity with such social and psychological functions. Typically, such indexes will focus on stylistic, mood, and similarity information" [1].
Most recent studies have devoted their attention to the MER problems above, datasets and improved machine learning techniques, while applying already existing audio features developed in other contexts, such as speech recognition or music genre classification.
On the other hand, in a previous work [5], we sustained that features specifically suited to emotion detection are needed to narrow the so-called semantic gap [12] and their lack hinders the progress of research on MER. In that work, we designed and implemented novel acoustic features, targeting particularly music expressivity and texture, which led to 9 percent classification improvement (F1-score). Hence, this study supports the argument that, to further advance the audio MER field, research needs to focus on what we believe is its main, crucial, and current problem: to capture the emotional content conveyed in music through better designed audio features.
This perspective might as well be transversal to most MIR problems, as pointed out in [13], where the authors affirm that "stagnation on most MIR task results is already acknowledged by MIR community". There, the first hypothesis raised is that "MIR approaches should perhaps be more musical knowledge-intensive" since, currently, mostly generic approaches are followed based on "the application of information retrieval solutions for music, without relying on musically meaningful features" [13]. As Pedro Domingos boldly states, "at the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used" [14].
State-of-the-art solutions are still unable to accurately solve simple problems, such as classification with few emotion classes (e.g., four to five). This is supported by both existing studies [5], [15] and the small improvements observed in the 2007-2019 Music Information Retrieval Evaluation eXchange (MIREX) 1 Audio Mood Classification task, an annual comparison of MER algorithms. There, the best algorithm achieved 69.8 percent accuracy in a task comprising five categories. Moreover, this score has remained stable for several years, which calls for methods that help breaking the so-called "glass ceiling" [12].
Given the crucial importance of emotionally-relevant audio features for MER, our goal in this survey is threefold: to summarize the most significant knowledge on the relations between music and emotion; this review is structured according to eight musical dimensions (melody, harmony, rhythm, dynamics, tone color, expressivity, texture and form) and sets the ground to identify needs in the design of emotionally-relevant audio descriptors; to review the current computational audio features that are relevant for MER, particularly the ones available in different open-access audio frameworks, e.g., Marsyas, MIR Toolbox, PsySound and Essentia; to unveil possible directions for future research on the topic of feature engineering for MER (based on the above reviews and the identified research needs), as a key effort to break the glass ceiling on audio MER. Over the years, other authors have offered surveys on Music Emotion Recognition. The most recent we are aware of is the one by Yang et al., from 2018 [15]. Other reviews have been published already several years ago, e.g., the one from 2012 by Yang and Chen [16] or erlier, e.g., [17]. The common characteristic between all of them is that they provide broad MER reviews, tackling topics such as emotion paradigms, approaches for the collection of ground-truth data, types of MER problems (e.g., single-label, multi-label or music emotion variation detection) and overviewing different MER systems. On the contrary, rather than providing a broad but less specific survey, our approach is to offer an updated, deep and specific review on one key MER problem: the design of emotionally-relevant audio features, something that deserved only a somewhat shallow overview in the abovementioned works.
To further clarify the focus of this survey, it is important to mention that approaches based on deep learning techniques are out of the scope of this article, since the breadth of this topic would probably merit a survey in itself. Nevertheless, possible research directions on deep learning for MER are briefly discussed. For the same reason, features based on other modalities, e.g., symbolic or lyrics features, are not covered either. Regarding symbolic features, since some current approaches establish a bridge between the audio and the symbolic MER domains by integrating an audio transcription stage into the feature extraction stage (as discussed in Section 4, e.g., [5]), possible research directions on the exploitation of symbolic features on MER are also briefly discussed.
This paper is organized as follows. Section 2 overviews the relations between music and emotion, which are detailed in Section 3. There, we describe specific associations between each of the eight musical dimensions and different emotions. Section 4 reviews the existing emotionallyrelevant computational audio features, organizing them by musical dimension. Section 5 discusses the gaps and needs to advance the study of audio feature design for MER and points directions for future research. Finally, Section 6 concludes the article.

MUSIC AND EMOTION: OVERVIEW
Music has been with us since prehistoric times, serving as a language to express our emotions. This is regarded as music's primary purpose [20] and the "ultimate reason why humans engage with it" [21].
Our analysis of the relations between music and emotions is structured according to the fundamental musical dimensions usually presented in the musicology literature. Musical dimensions are typically organized into four to eight different categories (depending on the author, e.g., [22], [23]), each representing a core concept. Here, we employ an eight-category organization comprising: melody, harmony, rhythm, dynamics, tone color (or timbre), expressivity, musical texture and musical form.
The organization of these dimensions is not strict. Many musical features are somehow interconnected and may interact and touch other dimensions. Thus, it can be argued that some of them could be placed in different musical categories. In any case, through this organization, we can understand: i) where features related to emotion belong; ii) which features can be extracted from audio signals with the existing algorithms; iii) and thus, which musical dimensions may lack computational models to extract audio features relevant to emotion.
The relations between music and emotions have been debated for millennia, with associations between modes and emotions found in ancient texts, from Indian, Middle Eastern (e.g., Persian), and far eastern (e.g., Japanese) traditions [21]. Natya Shastra (Naţya Sastra), an ancient Sanskrit Hindu text describing performance arts, estimated to have been written somewhere between 500 B.C. and 500 A.D. [24] suggests elements such as modes and musical forms as able to express particular emotions.
In ancient Greece, Plato advocated that "good rhythm wait upon good disposition, [. . .] the truly good and fair disposition of the character and the mind" [25]. In addition, Plato considered harmony as capable of moving the listener, arguing that both "rhythm and harmony find their way to the inmost soul and take strongest hold upon it" [25]. Aristotle supported the same ideas, stating that "rhythms and melodies contain representations of anger and mildness, and also of courage and temperance" [26], while different harmonies could range from relaxing to "violently exciting and emotional" [26].
Scientific studies focusing on the relations between music and emotions started more than a century ago. One of these early examples is a study by Hevner, where the author evaluated the influence of musical factors such as rhythm, pitch, harmony, melody, tempo and mode to each of the eight emotion clusters earlier proposed by her [27]. Along with such studies, music psychologists have proposed different emotion paradigms (e.g., categorical or dimensional) and related taxonomies (e.g., [27], [28]).
Up to this day, this research problem is still far from completely solved. Nevertheless, several contemporary research works had already identified possible correlations or in some cases causal associations between specific musical elements and emotions. One of the most widely accepted is mode: major modes are frequently related to emotional states such as happiness, whereas minor modes are often associated with sadness or anger [29]; simple, consonant, harmonies are usually happy, pleasant or relaxed. On the contrary, complex, dissonant, harmonies relate to emotions such as excitement, tension or sadness, as they create instability in a musical piece [4]. Many other musical elements have been related to emotion, namely, e.g., timing, dynamics, articulation, timbre, pitch, interval, melody, harmony, tonality, rhythm, mode, loudness, vibrato or musical form [4], [30].
Over the last decades, several associations have been identified, relating specific emotional responses to the musical dimensions described above. The next section details the most relevant findings in this area. For some musical elements, the research can be somewhat contradicting, which can be caused by many factors, from different research methodologies to differences in the scope of the studies (e.g., induced or perceived emotion, significant differences in methodologies, population, and others). This is also caused by the complexity of the topic and indicates that further research is needed.
Most of the associations that we describe below pertain to music emotion perception 2 or transmission, since most studies tackled that problem. Still, some studies do not clearly state whether their findings concern perceived or induced emotion.

RELATIONS BETWEEN MUSICAL DIMENSIONS AND EMOTIONS
In this section we review the known relations between the eight musical dimensions and different emotions.

Melody and Emotion
Melody can be defined as a horizontal succession of pitches (perceptual correlate of fundamental frequency), perceived by listeners as a single musical line. Given its central role in a musical piece, being (one of) the most memorable elements in a song, associations between melodic cues and emotions are expected and suggested since Plato. Some of the strongest relations are found between wider melodic ranges (pitch ranges) and energetic emotions such as joy [31] or fear [32], while narrow ranges are associated with lower arousal emotions, e.g., sadness, melancholy or tranquility [32]. Other melodic elements, such as ascending versus descending melodic contours, have been studied and related to several emotions [27]. However, some of these are disputed in other studies, arguing that the relation is more complex and involves interactions with other elements such as rhythm and modes [32]. These findings have been observed in cross-cultural studies, where listeners have also associated joy with simpler melodies and sadness with more complex ones [31], even when exposed to unfamiliar tonal systems. Table 1 summarizes the known relations between melody and emotion. There, ME stands for Musical Element.

Harmony and Emotion
If melody is said to be the horizontal part of music, harmony refers to its vertical aspect, i.e., the sound produced by the combination of various pitches (notes or tones) in chords.
Harmony, together with rhythm and melody, was thought as able to elicit emotions since ancient times. Consonant harmonies are usually associated with happiness, tranquility, serenity, while dissonant complex harmonies are related with negative emotional states, e.g., tension and sadness, due to the instability they create in the piece [4].
In addition, major modes have been frequently related with positive emotions (e.g., happiness), while minor modes are linked to negative ones (e.g., sadness) [32]. Some authors such as Cook et al. have tried to further understand this affective response to major/minor chords and resolved/ unresolved chords, concluding that this emotional association is "neither due to the summation of interval effects nor simply arbitrary, learned cultural artifacts, but rather that harmony has a psychophysical basis dependent on threetone combinations" [36]. 2. Emotion in music can be regarded as: i) perceived, as in the emotion an individual identifies when listening; ii) induced or felt, regarding the emotional response a user feels when listening, which can be different from the perceived one; iii) or transmitted, representing the emotion that the performer or composer aimed to convey [8].
The relations between harmony and emotion are summarized in Table 2.

Rhythm and Emotion
Rhythm represents the element of "time" in music, the patterns of long and short sounds and silences found in music.
Rhythm, together with melody and harmony, is one of the dimensions most associated with the emotional expression in music. In fact, some authors consider it the most important one, e.g., [37], [38]. Rhythm elements, such as the augmentation of tempo (from 90 to 150 bpm), has been shown to increase happiness and surprise measures (i.e., induce) [39], while decreasing sadness. In the study, the authors used two groups of words to study different emotion types: 3 "basic emotions" where users reported what they felt (i.e., induced emotion) on a scale of 1 to 8; and 4 "descriptive words" (tension, expressiveness, amusement and attractiveness) to classify (i.e., perceived emotion) the musical piece on a scale of 1 to 5.
In addition to tempo, the rhythmic unit of a piece has also been shown to influence the emotional message of a song. As an example, variations "of the rhythm of the melody without altering the musical line, harmonics or beat" [39], such as changes from whole and half notes (theme) to eighth or sixteenth, as well syncopated notes, were associated with specific emotions. Similar studies have supported the idea that rhythm is somehow influencing the emotional information in music, e.g., [40].
The associations between rhythm and emotion are summarized in Table 3, based on the reviews presented in [32], [33], [41], as well as the other mentioned papers.

Dynamics and Emotion
Dynamics represents the variation in loudness or softness of notes in a musical piece.
The influence of dynamics, namely loudness and loudness variations, in music emotions (both induced and perceived) have been studied by some researchers, some of which relate them with specific emotion states. Empirically, an association of loud music (high intensity) with powerful and intense emotions such as joy, anger or tension seems logical. In contrast, soft music is mostly linked to calm, serene or sad music. Such associations have been verified by several researchers [42], [38], [43]. Variations in loudness over a musical piece have also been studied. Namely, larger variations are usually more negative [43], while smaller variations are more positive [32]. Table 4 summarizes the associations between dynamics and emotion.

Tone Color and Emotion
Tone color (or timbre) is related to lower level elements and properties of the sound itself, e.g., amplitude and spectrum, essential to differentiate instruments and voices.
Several sound properties have been associated with emotional states. A rounder amplitude envelope is related with   [32], [38], while a sharper one gives rise to positive emotions such as happiness or surprise [32], with some authors also linking it to fear [38]. The number of harmonics has also been studied, where a lower number is associated with boredom, happiness or sadness [32], while a high number of harmonics is usually related with emotions with high arousal and negative valence, e.g., anger, disgust, fear [32].
The tone color of specific instruments has also been suspected to carry emotional expression cues. In fact, composers and movie and marketing directors select specific instruments to express distinct emotions. This idea has been supported by studies such as [45]. In this respect, Hailstone et al. state that "timbre (instrument identity) independently affects the perception of emotions in music after controlling for other acoustic, cognitive, and performance factors" [46]. These works highlight the importance of spectral centroid (brightness) as a "significant component in music emotion". Moreover, spectral centroid deviation, spectral shape, attack time and even/odd harmonic ratio were all considered relevant [45].
A summary of the relations between tone color and emotion is presented in Table 5.

Expressivity and Emotion
Expressive techniques in music encompass several ornaments and features that are used by both composers (to enrich their pieces) and performers (to express their emotions at specific moments). Both parts have been studied and related with specific emotional states. As an example, staccato articulation is normally associated with higher intensity and energetic [32], mostly negative as with fear and anger [38]. On the other hand, legato is associated with softness [32] and sadness [38]. Similar research has been conducted regarding vibratos and emotion expression, observing that "singing an emotional passage influences acoustic features of vibrato when compared with isolated, sustained vowels" [48]. To assess this, classical singers were asked to sing passages of their preference containing both high and low levels of emotion. The analysis of the recordings shows significant changes in vibrato characteristics such as frequency modulation rate and extent.
Regarding emotion expression by the performer, some studies highlighted that artists typically use different ornaments, such as accentuating specific notes considered happy, whereas not doing the same for sadness [49]. In addition, Timmers and Ashley studied the usage by flute and violin performers of specific ornamentations such as trills, turns, mordente, arpeggio and others, when they intended to express one of four specific affect terms (happiness, sadness, anger and love), and how these emotions were perceived by listeners [50]. The accuracy between intended versus rated emotions was lowest for happiness. The performers employed more complex ornamentations for angry and the least complex for sadness. Table 6 summarizes the main relations between expressivity and emotion.

Texture and Emotion
Musical texture refers to the way the rhythmic, melodic and harmonic information produced by musical instruments and voices is combined in a musical composition. It is thus related to the combination and relations between the musical lines or layers (one or more instruments with the same role) in a song.
Fewer studies have been conducted regarding musical texture and emotions and of these some contain contradicting results. In one of the oldest studies, Kastner and Crowder evaluated the emotional differences between monophonic (melody only) and homophonic textures (melody with block chords accompaniment) by children aged three to twelve. In that study, the unaccompanied version (monophonic) was rated as more positive [51]. A similar result was observed by Webster and Weir, where nonharmonized melodies were considered happier [52]. However, further studies attempting to replicate Kastner and Crowder's findings observed exactly the opposite result. There, not only children but also adult subjects considered monophonic sounds as less happy than accompanied ones [53], [54]. A possible explanation to this contradicting results are the different versions of "dense textures" used in each [55], where very basic/simple chords and a single instrument were used in the studies observing negative emotions, while the others used more complex (and thus, with higher density) accompaniments taken from published songbooks. These differences may influence greatly  other musical dimensions (e.g., harmony) making it harder to correctly compare the results.
Polyphonic textures, containing several voices, have also been explored recently, suggesting that music with a higher number of voices is perceived as more positive. Such musical excerpts were rated as "sounding more happy, less sad, less lonely, and more proud" [55].
Although further studies are required to better understand exactly how musical texture influences emotion, the existing ones have demonstrated that it can indeed influence emotion in music either directly or by interacting with other features such as tempo and mode [55]. Table 7 summarizes the associations between musical texture and emotions.

Form and Emotion
Musical form or musical structure refers to the overall structure of a musical piece and describes the layout of a composition as divided into sections.
Some studies have investigated possible relations between musical form and emotion. It seems that forms with lower complexity are associated with positive emotions [56] such as relaxation, joy or peace [31]. On the contrary, higher complexity forms usually result in more negative emotions such as sadness [31], which can be higher in arousal (e.g., aggressive) or lower (e.g., melancholy) depending on the dynamism (high or low, respectively) [56].
Some researchers explored the relation between emotion and form by changing the order of sections (in classical music) but no relevant results were obtained [57], [58].
The few associations found between musical form and emotions are presented in Table 8.

Interactions Between Musical Dimensions
As described in the previous sections, each musical element may influence distinct emotional expressions. In fact, emotional content in music is not defined exclusively by a single element but is built by the merging and interaction of several factors. Beyond studying associations concerning musical dimensions and emotions independently, these interactions between several musical dimensions and the associated emotional responses have also been studied and reviewed, e.g., [59], [60].
Such works unveil interesting indirect relations and interactions regarding the variation of specific elements and the corresponding emotional changes, as well as possible interactions between elements, resulting in different emotional states. One example is the interaction between tempo and mode [60]: high tempo and minor mode results in only high arousal, while the same high tempo, but with major mode, results in high arousal and positive valence. Several other authors have studied possible interactions, such as mode and tempo [37], the influence of pitch height, intensity and tempo in valence [42], the influence of rhythm, melodic contour and melodic progression in happy music [32] or interactions between tempo, texture and mode [52].

COMPUTATIONAL AUDIO FEATURES IN MER
In general terms, a feature is a characteristic part of something. Features help to distinguish one thing from another, by providing the essential descriptive primitives by which individual objects or works may be identified [61].
In musical terms, features may be characteristic of a musical work, of a movement, of a composer, of a very specific musical dimension, of a genre, and so forth. As Huron states, "what constitutes a feature depends on the scope of   [61]. For illustration, features can be employed to represent any aspect that is relevant to the identification of a song, from the chords, to abstract statistics regarding physical aspects of the sound wave, rhythm information and others. Summing it up, the goal of feature extraction is to reduce the information of songs to descriptors that can accurately describe them [15].
Over the last decades, several algorithms have been proposed to extract information from audio signals. These features have been developed to solve a myriad of problems, from speech recognition, to content-based retrieval, indexing, and fingerprinting. More recently, a few works studied how the human perception of music characteristics (e.g., tempo) correlates with these audio descriptors, e.g., [62], [63]. It was observed that some features, "in particular those related to loudness, timbre, harmony, and rhythm show high correlations with perceived emotions" [63]. Still, such studies are usually carried with small datasets or specific genres and further research is needed.
Nowadays, most of these feature extraction algorithms are implemented in state-of-the-art audio frameworks, commonly employed in most MIR studies. In this survey, we have reviewed the emotionally-relevant features from 4 common audio frameworks (Marsyas [64], MIR Toolbox (MIR TB) [65], PsySound [66] and Essentia [67]), based on the identified relations between different musical elements and emotions (as discussed in Section 3). The available frameworks vary greatly in many aspects, from user-friendliness to computational efficiency or the number of implemented algorithms. Some are aimed to research, requiring specific environments (e.g., MATLAB), while others are designed with performance in mind, more suited to be used in industry. For an in-depth review, see [68], [69].
In the following, we catalog the audio features that have been proposed in the literature over the years and are now available in these frameworks, organizing them according to the musical dimensions to which they are closest. Besides these frameworks, which implement most of the state-ofthe-art audio features, in a recent work, we have contributed with a set of emotionally-relevant audio features, comprising mostly expressivity and musical texture feature [5]. As will be discussed, those features are noticeably underrepresented in the discussed audio frameworks.
Many of the features are extracted repeatedly for smaller excerpts (analysis windows) of the entire audio clip, returning series of data. These frame-level features are usually integrated using statistical moments such as mean, standard deviation, skewness and kurtosis, as well as maximum and minimum, before being used with machine learning techniques.

Melody Features
In this section we describe the audio features that capture information primarily related with melody and its components, as summarized in Table 9.

Pitch
Pitch represents the perceived fundamental frequency of a sound. It is one of the three major auditory attributes of sounds, along with loudness and timbre. Pitch (as an audio feature) typically refers to the fundamental frequency of a monophonic sound signal and can be calculated using various techniques. One common method to calculate pitch, employed in Marsyas, MIR Toolbox and Essentia is the YIN algorithm [70]. PsySound3 also implements Swipe and Swipe 0 algorithms proposed by Camacho [71].

Virtual Pitch Features
Ernst Terhardt et al. proposed an algorithm to extract virtual pitch, which is related to the psychoacoustics and modelling of the perceived pitch [72]. The PsySound3 framework implements this algorithm.

Pitch Salience
The perception of pitch, in particular its salience, is a complex idea that can be roughly explained as how noticeable (that is, strongly marked) is the pitch in a sound, and was proposed as a quick measure of tone sensation. Pure tones have an average pitch salience value close to 0 whereas sounds containing several harmonics in the spectrum have higher salience values. Different approaches have been proposed to extract pitch salience, e.g., [73]. This feature is present in the MIR Toolbox and Essentia.

Predominant Melody F0
Several authors have proposed algorithms to estimate the fundamental frequency (F0) of the predominant melody in both polyphonic and monophonic music audio signals. This is still an open research problem, and most of the audio frameworks do not include polyphonic audio melody F0 extractors. Still, some of the proposed algorithms are nowadays available as separate tools, e.g., the MELODIA algorithm [73], provided in Essentia.

Pitch Content
Tzanetakis proposed a set of simple features extracted from folded and unfolded pitch histograms (in the folded pitch histogram all notes are mapped to a single octave) to describe pitch information [64]: FA0: Amplitude of the maximum peak of the folded histogram; UP0: Period of the maximum peak of the unfolded histogram; IPO1: Pitch interval between the two most prominent peaks of the folded histogram; SUM: The overall sum of the histogram. Although the author described these features in his PhD thesis about the Marsyas framework, the current documentation seems to ignore them. Due to this we could not confirm that the framework is able to extract them.
These features rely on the melody transcription of the original audio waveform. In that work, the authors employed the works by Salamon and G omez [73] and Dressler [74] to estimate predominant fundamental frequencies as well as saliences. The resulting pitch trajectories are then segmented into individual MIDI notes following the work by Paiva et al. [75].

Register Distribution
This class of features proposed in [5] indicates how the notes of the predominant melody are distributed across different pitch ranges. Each instrument and voice type have different ranges, which in many cases overlap. The authors selected 6 classes, based on the vocal categories and ranges for non-classical singers. The resulting metrics are the percentage of MIDI note values in the melody that are in each of the following registers: Soprano (C4-C6), Mezzo-soprano (A3-A5), Contralto (F3-E5), Tenor (B2-A4), Baritone (G2-F4) and Bass (E2-E4).
In addition, the authors also propose the register distribution per second, as the ratio of the sum of the duration of notes with a specific pitch range (e.g., soprano) to the total duration of all notes.

Note Smoothness (NS) Statistics
Also related to the characteristics of the melody contour, Panda et al. [5] propose a note smoothness feature as an indicator of how close consecutive notes are, i.e., how smooth is the melody contour. To this end, the difference between consecutive notes (MIDI values) is computed. The usual 6 statistics are also calculated.

Ratios of Pitch Transitions
In Panda et al. [5], the abovementioned extracted MIDI note values are used to build a sequence of transitions to higher, lower and equal notes.
The obtained sequence marking transitions to higher, equal or lower notes is summarized in several metrics, namely: Transitions to Higher Pitch Notes Ratio, Transitions to Lower Pitch Notes Ratio and Transitions to Equal Pitch Notes Ratio. There, the ratio of the number of specific transitions to the total number of transitions is computed.

Harmony Features
In this section we describe the audio features that capture information primarily related with harmony and its components (Table 10).

Inharmonicity
The inharmonicity feature is based on number of partials that are not multiples of the fundamental frequency. Inharmonicity influences the timbre perception of a given sound. One approach to compute this was proposed by Peeters et al. [76] and is implemented in Essentia. The MIR Toolbox measures the inharmonicity as the amount of energy outside the ideal harmonic series, which presupposes that there is only one fundamental frequency [65].

Chromagram
The chromagram (implemented in Marsyas, MIR Toolbox and Essentia) is used to estimate the energy distribution along pitch classes. It consists of a 12-dimension vector, one for each note, from A to G# (12 semitone pitch classes), with the respective intensities in each of these classes based on the spectral peaks of the waveform. It is also known as Harmonic Pitch Class Profile (HPCP) [65].

Chord Sequence
Extracting chords from an audio signal is a complex task, for which researchers have yet to propose robust solutions. The existing methods to estimate this are still experimental, based on pitch class profiles [77]. Essentia implements an algorithm based on this research, able to compute the sequence of chords in a song. Such algorithm calculates the best matching major or minor triad and outputs the result as a string (e.g., A#, Bm, G#m, C). The existing implementation is marked as experimental and requires further work before being usable.

Tuning Frequency
The tuning frequency (available in Essentia) is an estimation of the exact frequency (in Hz) on which a song is tuned. It is used as an intermediary step for HPCP calculation and key estimation but can also be applied for classification tasks such as western vs. non-western music [77].

Key Strength
Key strength (MIR Toolbox and Essentia) consists in the computation of the strength of each possible key candidate to be the key of a given song (e.g., outputting scores between 0 and 1, or -1 to 1). The algorithm is based on the cross-correlation of the chromagram [77].

Key and Key Clarity
These features (implemented in the MIR Toolbox and Essentia) give a broad estimation of tonal center positions and their respective clarity. This is based on peak picking in the key strength curve. There, the best key(s) is given by the peak abscissa, while the key clarity is the key strength associated with the best keys, i.e., the key ordinate [65].

Tonal Centroid Vector (6 dimensions)
In the MIR Toolbox, the tonal centroid is represented as a 6dimensional feature vector. It corresponds to a projection of the chords along circles of fifths, of minor thirds and of major thirds [78]. It is based on the Harmonic Network or Tonnetz, which is a planar representation of pitch relations, where pitch classes having close harmonic relations such as fifths, major/minor thirds have smaller euclidean distances on the plane. By calculating the euclidean distance between successive analysis frames of tonal centroid vectors, the algorithm detects harmonic changes such as chord boundaries from musical audio.

Harmonic Change Detection Function
PsySound3 implements the Harmonic Change Detection Function (HCDF), which is a method for detecting changes in the harmonic content of musical audio signals proposed by Harte et al. [78]. It can be interpreted as the flux of the tonal centroid, as in the distance between the harmonic regions of successive frames [78].

Sharpness
Sound can be subjectively rated on a scale from dull to sharp, and sharpness algorithms attempt to model this. Psy-Sound3 implements several algorithms [66], which are essentially weighted centroids of specific loudness.

Modality
Several algorithms exist to estimate modality, i.e., major vs. minor, returning either a binary label, e.g., major / minor, or a numerical value, e.g., between -1 (minor) and 1 (major) [65]. In the MIR Toolbox and Essentia, the typical strategies use the estimated strength of each key and consist of: the difference between the strength of the strongest major and minor keys the sum of all the differences between each major key and its relative minor key pair.

Rhythm Features
In this section we describe the audio features that capture information primarily related with rhythm and its components (Table 11).

Beat Spectrum
The beat spectrum (MIR Toolbox) has been proposed as a measure of acoustic self-similarity as a function of time lag. It is computed from the similarity matrix, obtained by comparing the spectral similarity between all possible pairs of frames from the original audio signal [79].

Beat Location
Different beat tracking algorithms have been proposed over time. These algorithms estimate the beat locations in an input signal. The Essentia framework implements several beat tracker and rhythm extractor functions, e.g., the multifeature beat tracker, which extends the idea of measuring the level of agreement between a committee of different beat tracking algorithms in a song-by-song basis [80]. Marsyas implements IBT, a real-time/off-line tempo induction and beat tracking system based on a competing multi-agent strategy that considers parallel hypotheses regarding tempo and beats [81].

Onset Time
Another way of determining the tempo is based on the computation of an onset detection curve, showing the successive bursts of energy corresponding to the successive pulses [76]. Peak picking is automatically performed on the onset detection curve, to show the estimated positions of the note onsets. This feature is provided by the MIR Toolbox and Essentia. In the case of the MIR Toolbox, its onset function is able to return the onset times using any of the following options: peaks, valleys, attack phase and release phase [65].

Event Density
This feature (MIR Toolbox) estimates the "speed" of a song based on the average number of events in a given time window, i.e., the number of note onsets per second [65].

Average Duration of Events
In the MIR Toolbox, the duration of events (e.g., a note) can also be estimated from its envelope. One possible approach to estimate this was proposed by Peeters et al. [76]. It consists in detecting attack and release phases and measuring the time (in seconds) between them when the amplitude is at least 40 percent of the maximum.

Tempo
Several algorithms have been proposed to estimate tempo [19], i.e., the speed of a given musical piece, usually indicated in beats per minute (BPM). This feature, available in Marsyas, the MIR Toolbox and Essentia through different alternative algorithms, is typically estimated by detecting periodicities from the onset detection curve [65].

Predominant Local Pulse (PLP) Novelty Curves
Grosche and Muller introduced a mid-level representation for capturing dominant tempo and predominant local pulse even from music with weak non-percussive note onsets and strongly fluctuating tempo [82]. Essentia implements this feature. While the PLP curve does not represent high-level information such as tempo, beat level or location of onset positions, it serves as a tool that may be used for tasks such as beat tracking, tempo and meter estimation.

Harmonically Wrapped Peak Similarity (HWPS)
Tzanetakis described a set of rhythmic content features calculated with recourse to the Beat Histograms of a song, which proved useful for musical genre classification [64]: A0, A1: relative amplitude of the first (A0), and second (A1) histogram peak; RA: ratio of the amplitude of the second peak divided by the amplitude of the first peak; P1, P2: Period of the first and second peak in BPM; SUM: histogram sum (indication of beat strength) Subsequently, HWPS, a feature following similar principles has been proposed and integrated into Marsyas to calculate harmonicity by taking "into account spectral information in a global manner" [83].

Metrical Structure
This feature provides a detailed description of the hierarchical metrical structure by detecting periodicities from the onset detection curve and tracking a broad set of metrical levels [65]. This extractor is used to calculate the meterbased tempo estimation in the MIR Toolbox.

Metrical Centroid and Strength
These functions provide two descriptors derived from the above metrical analysis performed in the MIR Toolbox: Dynamic metrical centroid: estimation of the metrical activity, based on the computation of the centroid of the selected metrical level [65]; Dynamic metrical strength: an indicator of the clarity and strength of the pulsation. Estimates whether a "clear and strong pulsation, or even a strong metrical hierarchy is present", or if the opposite is true, where "the pulsation is somewhat hidden, unclear" [65] or a complex mix of pulsations.

Note Duration statistics
Panda et al. propose note duration statistics (the same six ones, as proposed for the melody dimension), based on the duration of each note [5].

Note Duration Distribution
Moreover, note duration distribution features are also proposed in [5]: Short Notes Ratio, Medium Length Notes Ratio and Long Notes Ratio. Similarly, the authors compute the note duration distribution per second, for each of the three duration classes defined.

Ratios of Note Duration Transitions
Finally, Panda et al. also propose ratios of note duration transitions [5], namely, Transitions to Longer Notes Ratio, Transitions to Shorter Notes Ratio and Transitions to Equal Length Notes Ratio.

Rhythmic Fluctuation
This feature (present in the MIR Toolbox) estimates the rhythm content of an audio signal. This estimation is based on spectrogram computation transformed by auditory modeling followed by spectrum estimation in each band [84], i.e., the rhythmic periodicity along auditory channels.

Tempo Change
An indicator of tempo change over time is estimated by computing the difference between successive values of the tempo curve in the MIR Toolbox. This feature is expressed independently from the choice of a metrical level by computing the ratio of tempo values between successive frames and is expressed in logarithmic scale (base 2) [65].

Pulse / Rhythmic Clarity
This feature (implemented in the MIR Toolbox and Essentia) estimates the "rhythmic clarity", an indicator of the clarity and strength found in the beats estimated by tempo estimation algorithms. Distinct heuristics exist to this estimation. The most common uses the autocorrelation curve that is computed during tempo estimation [65]. Essentia computes an approximate metric calling it beats loudness.

Dynamics Features
In this section we describe the audio features that capture information primarily related with dynamics and its components (Table 12).

Root-Mean-Square (RMS) Energy
The RMS energy (implemented in Marsyas, the MIR Toolbox and Essentia) is used to measure the power of a signal over a window, or global energy. This is usually computed by taking the root-mean-square (RMS) [64]. It roughly describes the loudness of a musical signal.

Low Energy Rate
Low energy rate (available in Marsyas and the MIR Toolbox) measures the percentage of frames with less-than-average energy [64]. This metric estimates the temporal distribution of energy, in order to understand if this energy remains constant between frames or if some frames are more contrastive than others.

Sound Level
This descriptor (present in PsySound3) corresponds to the power sum of the spectrum for each time window, expressed in decibel. At a higher level, when appropriately calibrated, this represents the unweighted sound pressure level of the signal in each analysis window [66].

Instantaneous Level, Frequency and Phase
These features (implemented in PsySound3) consist in applying a Hilbert transform to the audio waveform, resulting in three different outputs: the instantaneous level, instantaneous frequency and instantaneous phase. The instantaneous level can be regarded as the sound pressure level derived from the Hilbert transform [66].

Loudness
Sound loudness is the subjective perception of the intensity of a sound. This metric is measured in sones, where a doubling in sones corresponds to a doubling of loudness [66]. Several loudness metrics have been proposed over the years, which are available in PsySound3 and Essentia.

Timbral Width
Timbral width (PsySound3) is one of six measures of timbre proposed by Malloch in a method called loudness distribution analysis [85]. Timbral width can be regarded as "a measure of the fraction of loudness that lies outside of the loudest band, relative to the total loudness" [85].

Volume
Volume refers roughly to the perceived "size" of the sound, or the auditory volume of pure tones. This concept was first studied by Stevens [86] and, later on, Cabrera [87] developed a computational volume model for arbitrary spectra, which was integrated into PsySound3. In his work, Cabrera proposes two diotic volume models. The first uses a weighted ratio between the binaural loudness and sharpness, which is the specific loudness centroid on the Bark scale. A second and better performing model uses a simpler centroid to overcome limitations in the method of sharpness calculation selected by the authors [87].

Sound Balance
Sound balance can be estimated through the Maximum Amplitude Position to Total Envelope Length Ratio (Max-ToTotal and MinToTotal), provided in the MIR Toolbox and Essentia. This is a metric to understand how much the maximum amplitude (peak) in a sound envelop is off the center. To this end, the ratio between the index of the maximum (or minimum) value of the envelope of a signal and the total length of the envelope is computed. If the peak amplitude is found close to the beginning (e.g., decrescendo sounds), this ratio will be close to 0. A value of 0.5 means that the peak is close to the middle and near 1 if at the end of the sound (e.g., crescendo sounds) [69].

Note Intensity statistics
Panda et al. compute the usual 6 statistics based on the median pitch salience of each note [5].

Note Intensity Distribution
In addition, Panda et al., 2018 propose note intensity distribution features [5]. This class of features indicates how the notes of the predominant melody are distributed across three intensity ranges, leading to the following features: Low Intensity Notes Ratio, Medium Intensity Notes Ratio and High Intensity Notes Ratio. The same features are also computed per second.

Ratios of Note Intensity Transitions
Panda et al., 2018 also propose ratios of Note Intensity Transitions: Transitions to Higher Intensity Notes Ratio, Transitions to Lower Intensity Notes Ratio and Transitions to Equal Intensity Notes Ratio [5].

Crescendo and Decrescendo (CD) metrics
Panda et al. identify notes as having crescendo or decrescendo based on the intensity difference between the first and the second half of the note [5]. From these, the authors compute the number of crescendo and decrescendo notes (per note and per second). In addition, they compute sequences of notes with increasing or decreasing intensity, computing the number of sequences for both cases (per note and per sec) and the length of crescendo sequences in notes and in seconds, using the 6 usual statistics.

Tone Color Features
In this section we describe the audio features that capture information related with tone color (timbre) and its components (Table 13).

Attack/Decay Time
One of the aspects influencing tone color is the sound envelope, which can be divided into four parts: attack, decay, sustain and release. Several descriptors can be extracted from it, mostly related with the attack phase, i.e., from the starting point of the envelope until the amplitude peak is attained. One of these descriptors is the attack time (present in the MIR Toolbox and Essentia), which consists in the estimation of temporal duration of the various attack phases in an audio signal [76]. The MIR Toolbox is also able to compute the decay time.

Attack/Decay Slope
The attack slope (available in the MIR Toolbox) is another descriptor extracted from the attack phase [76]. It consists on the estimation of the average slope of the entire attack phase, since its start to the peak. The MIR Toolbox is also able to extract the same information from the decay phase, related to its decrease slope [65].

Attack/Decay Leap
The attack leap is a simple descriptor related to the attack phase. In the MIR Toolbox, it consists in the estimation of the amplitude difference between the beginning (bottom) and the end (peak) of the attack phase [65]. As with the previous features, the MIR Toolbox outputs a similar descriptor related with the decay phase.

Zero Crossing Rate (ZCR)
The Zero Crossing Rate (Marsyas, MIR Toolbox Essentia) represents the number of times the waveform changes sign in a window (crosses the x-axis). It can be used as a simple indicator of change of frequency or noisiness. As an example, heavy metal music, due to guitar distortion and heavy percussion, will tend to have much higher zero crossing values than classical music [64]. Sometimes the ZCR derivative is also computed, representing the absolute value of the window-to-window change in zero crossing rate.

Spectral Flatness
The spectral flatness (Marsyas, MIR Toolbox, Essentia) indicates whether the spectrum distribution is smooth or spiky, i.e., estimates to which degree the frequencies in a spectrum are uniformly distributed (noise-like) [65]. It is usually computed as the ratio between the geometric mean and the arithmetic mean [76]. Marsyas adopts a different approach, proposed in [88], calculating the spectral flatness in different spectral bands.

Spectral Crest Factor (SCF)
The spectral crest factor [88] is a measure of the "peakiness" of a spectrum and is inversely proportional to the spectral flatness measure. It is commonly used to distinguish noiselike from tone-like sounds due to their different spectral shapes, where noise-like sounds have lower spectral crests. In Marsyas, the SCF is computed as the ratio of the maximum and mean spectrum powers of a subband.

Irregularity
Irregularity, also known as spectral peaks variability, is the degree of variation of the amplitude of successive spectral peaks [65]. This feature is present in the MIR Toolbox.

Tristimulus
The tristimulus feature [76], implemented in Essentia, quantifies the relative energy of partial tones by three parameters that measure the energy ratio of the first partial (tristimu-lus1), second, third and fourth partials (tristimulus2) and the remaining (tristimulus3).

Odd-to-Even Harmonic Energy Ratio
The odd-to-even harmonic energy (Essentia) ratio "distinguishes sounds with predominant energy at odd harmonics (such as clarinet sounds) from other sounds with smoother spectral envelopes (such as the trumpet)" [76]. The four spectral moments (implemented in the MIR Toolbox, PsySound and Essentia) are useful measures of spectral shape [76]. The spectral centroid (also available in Marsyas) is the first moment (mean) of the magnitude spectrum of the short-time Fourier Transform (STFT). The spectral spread represents the standard deviation of the magnitude spectrum. Thus, it is a measure of the dispersion or spread of the spectrum.
Spectral skewness is the third central moment of the magnitude spectrum and it is a measure of its symmetry.
Finally, in simple terms, spectral kurtosis, or the fourth central moment of the magnitude spectrum, captures information about existing outliers.

Spectral Entropy
The spectral entropy of a signal is a measure of its spectral power distribution, based on Shannon entropy [89] from the information theory field. This feature is implemented in the MIR Toolbox and Essentia.

Spectral Flux
Spectral flux (Marsyas, MIR Toolbox, Essentia) is a measure of the amount of spectral change in a signal, i.e., the distance between the spectra of successive frames [64]. Spectral flux has also been shown by user experiments to be an important perceptual attribute in the characterization of the timbre of musical instruments [90].

Spectral Rolloff
Spectral rolloff (Marsyas, MIR Toolbox, Essentia) is often used as an indicator of the skewness of the frequencies present in a window. According to Tzanetakis [64], the spectral rolloff is defined as the frequency R_t below which 85 percent of the magnitude distribution is concentrated. The percentage varies among authors, but 85 percent is the current default value for most frameworks.

High-Frequency Energy
Several algorithms have been proposed to estimate the high-frequency content in a signal. Brightness (also called high-frequency energy) is one of such algorithms, implemented in the MIR Toolbox. This typically consists in fixing a minimum frequency value and measuring the amount of energy above that frequency [65]. The Essentia framework implements a different algorithm, named high-frequency content (HFC), to measure the amount of high-frequency energy from the signal spectrum. HFC is computed by applying one of the several algorithms, e.g., [91].

Cepstrum (Real/Complex)
The cepstrum is the result of taking the inverse Fourier transform of the logarithm of the estimated spectrum of a signal [92]. It can be regarded as a measure of the rate of change in the different spectral bands. Cepstral analysis has applications in fields such as pitch analysis, echo detection and human speech processing, by providing a simple way to separate formants (due to filtering in the vocal tract) from the vocal source [93]. Cepstral analyzers are available in PsySound3.

Energy in Mel/Bark/ERB Bands
In audio signal processing, it is often important to decompose the original signal into a series of audio signals of different frequencies (i.e., low to high-frequency channels), enabling the study of each channel separately. This is inspired by the human cochlea, which can be regarded as a filter bank, distributing the frequencies into critical bands. Several scales have been proposed, each one using a particular range of frequencies, e.g., the Mel, Bark or Equivalent rectangular bandwidth (ERB) scales [94]. The energy in the Mel/Bark bands is computed in the MIR Toolbox and in Essentia. The energy in the ERB bands is computed in the same two frameworks, as well as PsySound3.

Mel-Frequency Cepstral Coefficients (MFCC)
MFCCs [95] are another measure of spectral shape. The frequency bands are positioned logarithmically on the Mel scale and cepstral coefficients are then computed based on the Discrete Cosine Transform of the log magnitude spectrum. Typically, only the first 13 cepstral coefficients are usually returned by audio frameworks. These 13 coefficients are mostly used for speech representation but Tzanetakis states that "the first five coefficients are adequate for music representation" [64]. This descriptor is provided by Marsyas, the MIR Toolbox and Essentia.

Linear Predictive Coding Coefficients (LPCC)
Linear predictive coding is used in speech research to represent the spectral envelope of a digital speech signal in compressed form, using to this end information of a linear predictive model [96]. LPCCs, available in Marsyas and Essentia, represent the cepstral coefficients derived from linear prediction and have been used in a wide range of speech applications, such as speech analysis, encoding and speech emotion recognition [96].

Linear Spectral Pairs (LSP)
Linear Spectral Pairs (available in Marsyas) are an alternative representation of linear prediction coefficients (LPC) for transmission over a channel. LSPs have several properties (e.g., smaller sensitivity to quantization noise) that make them superior to direct quantization of LPCs. Thus, LSPs are useful in speech recognition and coding [97].

Spectral Contrast
The octave-based spectral contrast is a feature proposed by Jiang et al. [98] to represent the spectral characteristics of an audio signal, specifically the relative spectral distribution. According to the authors, the feature has been tested in music type classification problems, demonstrating a "better discrimination among different music types than mel-frequency cepstral coefficients (MFCC)" [98], which is one of the features typically used in such problems. It is implemented in Essentia.

Roughness (Sensory Dissonance)
Sensory dissonance, also known as roughness, is related to the beating phenomenon that occurs whenever a pair of sinusoids are close in frequency [99]. This feature is implemented in Marsyas, the MIR Toolbox and Essentia using different algorithms, the method by Sethares, which estimates total roughness by averaging all dissonance estimates across all possible peak pairs of the spectrum [100].

Spectral and Tonal Dissonance
PsySound3 computes spectral and tonal dissonance features. Dissonance measures the harshness or roughness of the acoustic spectrum [66]. The dissonance generally implies a combination of notes that sound harsh or are unpleasant to people when played at the same time. Psy-Sound3 provides two descriptions of acoustic dissonance: "spectral dissonance" which uses all Fourier components, and "tonal dissonance" which uses a peak extraction algorithm before calculating dissonance.

Expressivity Features
In this section we describe the audio features that capture information primarily related with expressiveness. As will be observed, we are only aware of one feature of this type in the analyzed audio frameworks. Hence, we have recently proposed a set of novel features targeting expressivity features [5]. Table 14 summarizes the available expressivity features.

Average Silence Ratio (ASR)
Average Silence Ratio is a feature proposed by Feng et al. as an estimation for articulation [3]. It is defined as the ratio of silence frames in one-second time windows. According to the author "lower ASR means fewer silence frames present in musical piece, or legato in articulation, and the higher ASR means more silence frames present in musical piece, or staccato in articulation". This feature is implemented in the MIR Toolbox.

Articulation Metrics
Articulation is a technique affecting the transition or continuity between notes or sounds. Panda et al. [5] proposed an approach to detect legato (i.e., connected notes played "smoothly") and staccato (i.e., short and detached notes). Based on their algorithm, all the transitions between notes in the song clip are classified and, from them, several metrics are extracted such as ratio of staccato, legato and other transitions and longest sequence of each articulation type.

Glissando Metrics
Glissando is another kind of expressive articulation, which consists in the glide from one note to another. It is used as an ornamentation, to add interest to a piece and thus may be related to specific emotions in music. Panda et al. [5] proposed a glissando detection algorithm based on which several glissando features are extracted, e.g., glissando presence, extent, duration, direction, slope and glissando to non-glissando ratio (i.e., the ratio of notes containing glissando to the total number of notes).

Portamento Metrics
Computational models of portamento, the smooth and monotonic increase or decrease in pitch from one note to the next, were proposed in [101] by using Hidden Markov Models in the vibrato-free pitch curve (flatten out).

Vibrato Metrics
Vibrato is an expressive technique used in vocal and instrumental music that consists in a regular oscillation of pitch. Its main characteristics are the amount of pitch variation (extent) and the velocity (rate) of this pitch variation. Panda et al. [5] proposed a vibrato detection algorithm based on the analysis of F0 sequence of each note, from which several features are extracted, e.g., vibrato presence, rate, extent, coverage, high-frequency coverage, vibrato to non-vibrato ratio and vibrato notes base frequency. Other approaches to extract vibrato parameters were proposed, such as using filter diagonalization methods [101] or directly from the spectrogram using predefined vibrato templates [102].

Tremolo Metrics
Tremolo is a trembling effect, somewhat similar to vibrato but regarding change of amplitude. Although, in the survey presented in Section 3, we have not found any relations between tremolo and emotion, we decided to extract a number of tremolo metrics, based on a tremolo detection algorithm, similar to our vibrato detection approach [5]. There, the sequence of pitch saliences of each note is used instead of the F0 sequence, since tremolo represents a variation in intensity or amplitude of the note. Several tremolo features are extracted, e.g., tremolo presence, rate, extent, coverage, and tremolo to non-tremolo ratio.

Texture Features
In this section we describe the audio features that capture information primarily related with musical texture. To the best of our knowledge, none of the features studied or found in the analyzed audio frameworks are primarily related with musical texture. As such, we have recently proposed a set of novel musical texture features in [5], where the sequence of multiple frequency estimates was employed to measure the number of simultaneous layers in each frame of the entire audio signal, leading to the features summarized in Table 15 and described below.

Musical Layers Statistics
Panda et al. proposed musical layer statistics [5]. There, the number of multiple F0s are estimated from each frame of the song clip. The number of layers in a frame is defined as the number of obtained multiple F0s in that frame. Then, the 6 usual statistics regarding the distribution of the number of musical layers across frames were computed.

Musical Layers Distribution
Additionally, in [5] the number of F0 estimates in a given frame is divided into four classes: i) no layers; ii) a single layer; iii) two simultaneous layers; iv) and three or more layers. The percentage of frames in each class is computed.

Ratio of Musical Layers Transitions
Panda et al. [5] proposed these features to capture information about the changes from a specific musical layer sequence to another. They employ the number of different fundamental frequencies in each frame, identifying consecutive frames with distinct values as transitions and normalizing the total value by the length of the audio segment (in secs). In addition, they also compute the length in seconds of the longest segment for each musical layer.

Form Features
In this section we describe the audio features that capture information primarily related with musical form. Extracting musical form and structure information directly from the audio signal is more difficult when compared to other lower level features (e.g., spectral/timbral statistics). Thus, few computational extractors are available today, as presented in Table 16 and described below.

Structural Change
The amount of change of various underlying basis features at different time intervals, combined into a meta-feature, correlates with the human perception of complexity in music [103]. The typical implementation uses chroma, rhythm and timbre information and exclusively aims at discovering the quantity of change, illustrating it with a visual audio flower plot [103].

Similarity Matrix
Some approaches estimate musical structure based on the similarity between adjacent segments or frames [65]. These similarities are often represented using an inter-frame or inter-segment similarity matrix, showing the differences between all possible pairs of frames from the input audio signal. The similarity matrix computation uses a specific set of frame statistics (e.g., spectral features) and a distance function, to calculate the proximity between each pair of frames. As an example, the MIR Toolbox can use MFCCs, key strength, tonal centroid, chromagram and others with one of several distance functions.

Novelty Curve
Based on the specific musical characteristics of each segment or frame, obtained for instance with a similarity matrix, a novelty curve can be obtained by comparing the successive frames to estimate temporal changes in the song [65]. In this novelty curve, implemented in the MIR Toolbox, the probability of transitioning to a different state over time is represented by the curve peaks.

Higher-Level (HL) Form Analysis
Modeling the fundamental aspects of musical sections in a unified way to identify song elements such as intro, bridge or chorus is still and open problem. Some of the most promising approaches apply higher-level solutions combining low-level features, statistics and machine learning. These include hierarchical semi-markov models [104], convex non-negative matrix factorization, spectral clustering [105] and deep learning [106].

Vocal Features
A few works have studied emotion in speaking and singing voice [107], as well as the related acoustic features [108]. In fact, "using singing voices alone may be effective for separating the "calm" from the "sad" emotion, but this effectiveness is lost when the voices are mixed with accompanying music" and "source separation can effectively improve the performance" [15].
To this end, Panda et al. [5] applied the singing voice separation approach proposed by Fan et al. [109] (although separating the singing voice from accompaniment in an audio signal is still an open problem) and the Voice Analysis Toolkit, a "set of Matlab code for carrying out glottal source and voice quality analysis" 3 to extract the features summarized in Table 17 and described below.

All Features From the Vocals Channel
Besides extracting features from the original audio signal, Panda et al. [5] also extracted the previously described features from the signal containing only the separated voice.

Voice and Unvoiced Statistics
In [5], the authors also proposed statistics related to the amount of voiced and unvoiced sections in a song. These include, among others, the number of voice segments, the mean, maximum, minimum, standard deviation, kurtosis  and skewness of the duration of voice segments, as well as the number of voice segments per second.

Creaky Voice Statistics
Panda et al. [5] computed statistics related with the presence of creaky voice, "a phonation type involving a low frequency and often highly irregular vocal fold vibration, [which] has the potential [. . .] to indicate emotion" [110].

High-Level Features
Finally, frameworks such as the MIR Toolbox and Essentia provide a few experimental higher-level features, related with complex concepts such as emotion, genre or danceability. Most, if not all, of these are predictors, combining classification algorithms and previously gathered data to label the source audio files into a fixed set of tags. A summary of these predictors is presented in Table 18 and listed below.

Emotion
The MIR Toolbox extracts an emotion descriptor based on the analysis of the audio content of a given recording. The output is given in two distinct paradigms: a categorical approach comprising 5 emotions and a 3-dimensional space composed of activity (energetic arousal), valence (pleasuredispleasure continuum) and tension (tense arousal).
The classification process is based on the work by Eerola et al. [111] and uses multiple linear regression with the 5 best performing predictors. Given its reliance on previously established weights, this extractor is only reliable in the MIR Toolbox version (v1.3) where it was initially "calibrated". Newer versions output "distorted results" [65].
The Essentia library implements a similar feature, classifying songs in 4 distinct emotions. It contains pre-trained models and requires the Gaia library to apply similarity measures and classifications on the extracted features [67].

Classification-Based Features (Genre, etc.)
In a similar way to the emotion descriptor extractor (or predictor), Essentia also includes Gaia trained models for [67]: musical genre (using 4 different databases) ballroom music classification western / non-western music tonal / atonal danceability voice / instrumental gender (male / female singer) timbre classification These musical descriptors work as a typical classification problem, by extracting a set of features from the source audio signals and feeding them to classification models trained with them in other datasets.
The genre feature is particularly relevant for music emotion recognition since some emotions are frequently associated with specific genres, as concluded by Laurier [4]. The author used automatic genre classification to improve his previous emotion classification results.

Danceability
As opposed to the aforementioned danceability extractor built as a pre-trained classification model, Streich proposed a low-level audio feature derived from Detrended Fluctuation Analysis (DFA) to characterize audio signals in terms of its danceability [112].

Dynamic Complexity
Streich also studied the automated estimation of the complexity of music based on the musical audio signal, proposing a set of complexity descriptors [112]. The proposed algorithms focus on aspects of acoustics, rhythm, timbre, and tonality. The Essentia library implements an extractor to estimate dynamic complexity, or whether a song contains a high dynamic range. This descriptor consists in the average absolute deviation from the global loudness level estimate on the dB scale. Table 19 presents the number of described features per musical dimension.

Feature Analysis Along Musical Dimensions
As abovementioned, many of these features are framelevel features, which are normally integrated using statistical moments. This increases the final number of descriptors to several hundred [5] and is especially true for tone color features, where some features divide the audio signal in bands and output time-series data (e.g., MFCCs). As such, and based on the figures in Table 19, we conclude that the number of available audio features is very unbalanced  across musical dimensions. Musical texture, expressivity and form are especially lacking, in contrast to tone color, which is the most represented category, mostly due to the large set of spectral features available (centroid, etc.). In [5], we have contributed to reduce that imbalance by proposing emotionally-relevant features, particularly for the expressivity and texture dimensions.
The low number of texture, form and expressivity features is not a surprise. We believe this is caused by two main reasons: i) on the one hand, the difficulty to create robust algorithms to capture such music elements; ii) on the other hand, the lack of music psychology studies on the relations between emotion and those dimensions, which could drive the creation of computational models.
Regarding the analysis of the importance of specific features to emotion recognition, few studies have addressed this issue in a systematic way, e.g., [5]. There, the conducted analysis, based on Russell's emotion quadrants [28], suggested that tone color features (particularly spectral features) dominated all quadrants, possibility due to their prevalence (as discussed above). Nevertheless, texture features were in the top5 for quadrant 2 (anxiety quadrant, or Q2) and proved relevant for Q1 (happiness), as well, helping to improve the classification performance of the proposed algorithm. Vibrato was also an important feature for Q2. As for Q3 (depression), besides tonal features, texture, inharmonicity and tremolo also proved relevant, along with vocal features. Finally, dynamics, texture and expressivity features (namely, vibrato) were important to discriminate Q4 (contentment).
Besides the lack of texture, form and expressivity features, "more features are needed to better discriminate Q3 from Q4, which musically share some common characteristics such as lower tempo, less musical layers and energy, use of glissandos and other expressive techniques" [5]. Thus, in the next section we discuss research directions to advance the state-of-the-art in the creation of novel emotionally-relevant features for each musical dimension.

Form
Regarding computational models of form complexity, we are only aware of one work, which might work as a surrogate of musical complexity [103]. Higher-level features to capture form types from audio are still missing and some recent works have been attacking the problem with higher level solutions, e.g., employing machine learning to identify elements such as verse and chorus [104], [105], [106].
The impact of other elements of form on emotion, e.g., organizational levels (passage, piece, cycle) or song elements (introduction, chorus, bridges, etc.), should be further researched by the music psychology community, despite a few computational models found in the literature that might partially capture such information (e.g., similarity matrix and novelty curve).

Texture
The texture dimension, as abovementioned, requires further music psychology studies to better understand how it influences emotion. Nevertheless, the features we proposed in [5] proved relevant, namely the number of musical layers in the recognition of happy music.
These features only approximate the actual number of layers in a song, hence more advanced computational models are needed, probably requiring robust source separation and instrument recognition in polyphonic music. This is an active research problem (e.g., [113]), with great advances in the last years due to the application of deep learning models, as is the case of the Spleeter library, able to perform various types of separation (e.g., vocals, accompaniment, drums, bass, and others) [114].
Tackling this problem would also serve the creation of algorithms for the detection of texture types (monophonic, homophonic, polyphonic) and density (thin, thick), for which no computational models are known (see Table 15).

Expressivity
Regarding expressivity, the music psychology community has offered important inputs to understand its impact on emotion. Yet, despite our contributions with several articulation (staccato and legato), glissando, vibrato and tremolo metrics, this dimension still lacks computational models, particularly for the detection of ornamentations other than glissando and portamento (see Tables 6 and 14). Also, the algorithms we proposed were only indirectly evaluated through their impact on emotion classification, and so ground truth data on those problems is needed.

Melody
As for the other musical dimensions, music psychology researchers have provided a great amount of knowledge that could be further exploited to create computational models that capture such musical elements.
Starting with melody, most melodic elements are reasonably covered, as summarized in Table 9. However, features for melodic intervals are still missing. Moreover, further computational features related to melodic movement, direction and contour should be developed. As with many other problems in Music Information Retrieval, problems such as full or melody transcription are still open, which limits the accuracy of current MER systems that rely on them. This also applies to computational models of the dimensions discussed below (e.g., tonality and rhythm).

Rhythm
Regarding rhythm, although most rhythmic elements are reasonably covered this dimension is missing computational features that capture rest characteristics (Table 11). Still, higher-level audio features that capture the types of rhythm (regular, irregular, complex, fluent, etc.) are still missing (see Tables 3 and 11).

Dynamics
As for dynamics, all elements have associated features (Table 12). Still, computational models to detect the types of dynamic levels (forte, piano, etc.) would be beneficial.

Tone Color
The tone color dimension is also reasonably well covered, particularly regarding spectral characteristics (see Table 13). Still, as with musical texture, tone color would also benefit from accurate instrument recognition in polyphonic context. Moreover, this dimension would also benefit from higherlevel features on the types of amplitude envelope (e.g., round, sharp).

Vocal Features
As for vocal features, with the recent advances in areas such as source separation, as previously described, new paths should be explored. For instance, additional features that proved useful for speech emotion should be taken into consideration [16]. Moreover, the idea can be extended, e.g., by further separating the accompaniment and analyzing each layer in isolation, since they may sometimes carry different emotional information [15]. This can be complemented with genre or even lyrical information (natural language processing) and integrated with a meta-classifier.

Deep Learning Perspectives
Finally, besides the classical handcrafted feature engineering approach, deep learning/feature learning techniques have attracted great attention in the last years. The most notable example is the resurgence of neural network techniques, specifically deep learning, to a myriad of problems, fueled by the improvements in computer processing (e.g., using graphic processing units). Several MER studies have already employed techniques such as convolutional and recurrent neural networks [10].
Despite (so far) slight improvements in classification accuracy, such approaches raise several points that must be considered. First, to fully exploit the potential of deep learning solutions, massive amounts of good quality data are required. Unfortunately, the creation of large MER datasets have been known to be problematic due to the associated subjectivity and complexity of data collection [5]. Hence, strategies to obtain sizeable and good quality data for audio MER are a key need.
Also, deep learning models are opaque in the sense that the extracted features are often difficult to interpret, which hinders the possibility to acquire novel knowledge regarding the relations between emotions and the extracted features. In fact, "although deep neural networks have exhibited superior performance in various tasks, interpretability is always [the] Achilles' heel" of such approaches, despite a few efforts to address it, as surveyed in [115]. Hence, interpretability issues in deep neural networks are another important problem to tackle in the future.

Audio-Based Symbolic Features
As discussed, some approaches bridge the audio and the symbolic MER domains by integrating an audio transcription step into the feature extraction stage. Hence, the approached followed in [5] can be further exploited by integrating symbolic (MIDI) features available in several frameworks, e.g., MIDI Toolbox or jSymbolic [2].

CONCLUSION
This article offered a comprehensive review of the current emotionally-relevant computational audio features. Unlike previous broad MER surveys, this review offered a deep and specific review on that key MER problem. This survey was supported by the music psychology literature on the relations between eight musical dimensions (melody, harmony, rhythm, dynamics, tone color, expressivity, texture and form) and specific emotions. From this review, we concluded that computational audio features able to capture elements of musical form, texture and expressivity are especially needed to break the current glass ceiling in MER, as shown in [5]. Moreover, the development of such computational tools would benefit from further music psychology studies, particularly regarding the actual impact of musical form and texture on emotion. We believe this article opens several research lines to expand the stateof-the-art on Music Emotion Recognition. Renato Panda received the bachelor's, master's, and PhD degrees from the University of Coimbra. He currently is an invited professor with the Polytechnic Institute of Tomar. He is a member of the Cognitive and Media Systems group at the Center for Informatics and Systems of the University of Coimbra (CISUC). His main research interests include music emotion recognition (MER) and music information retrieval (MIR).

Ricardo
Malheiro received the bachelor's degree in mathematics, and the master's degree in informatics engineering (Licenciatura -5 years), and PhD degree, all from the University of Coimbra. He is currently a professor with Miguel Torga Higher Institute, Coimbra. He is also a member of the CMS research group at CISUC. His main research interests include natural language processing, detection of emotions in music lyrics and text and text/data mining.
Rui Pedro Paiva received the bachelor's degree in 1996, master's degree in 1999, and the doctoral degree, in 2007, all from the Department of Informatics Engineering of the University of Coimbra. He is currently a professor with the Department of Informatics Engineering with the University of Coimbra. He is also a member of the CMS group at CISUC. His main research interests include the areas of MIR and health informatics. The common research interests include the study of feature engineering, machine learning and signal processing to the analysis of musical and bio signals