The Echo Nest analysis API takes an mp3 file and returns an XML file (see Figure) containing the structural and perceptual description (i.e. segmentation, pitch, timbre, and rhythm data) of the audio signal in the mp3. The analysis XML content is described below.

Some of the data in the XML file for Air’s “Kelly Watch The Stars”

XML data description


<?xml version="1.0" encoding="UTF-8"?>

<!--The Echo Nest Metadata v22. No warranty or copyright implied on the original signal. http://www.echonest.com-->

<Analysis>

    <!-- Analysis: root element of this file tree. The tree contains an acoustic description (structure, perceptual features) of the mp3 file, where the mp3 data md5 corresponds to the name of this XML file. -->


    <Decoder name="FFmpeg" version="FFmpeg: SVN-r9800, libavutil: 49.4.1, libavcodec: 51.40.4, libavformat: 51.12.1">

        <!-- Decoder: contains decoder specific information. Because audio decoders and versions behave slightly differently from one to another, your own mp3 decoder may vary with the one used by this API. You may find small offsets in timing (typically under 100 ms) and amplitude (perhaps a few dB) in this file as compared with what’s perceived with your decoder. If synchronization is important in your application, please offset accordingly. -->

            <!-- name: this API audio decoder. -->

            <!-- version: this decoder version. -->

    </Decoder>


    <Track duration="448.41664" usableDuration="441.59710" version="22">

        <!-- Track: contains global estimated information, applicable to the whole track. Note that because local data may vary in the course of the track, this global estimation may not apply well to certain sections of the music. -->

            <!-- duration: full duration of the audio (in seconds). -->

            <!-- usableDuration: duration of the audio after removing fade-in and fade-out sections (in seconds). -->

            <!-- version: current version of this API. -->


        <Tags endOfFadeIn="1.23672" startOfFadeOut="441.59710" sizeTimbre="12" sizePitches="12" numBeats="888" numTatums="1502" numSegments="1312" numSections="34" segmentDurationMean="0.34178" segmentDurationVariance="0.08604" timeLoudnessMaxMean="0.08763" loudnessMaxMean="-14.947" loudnessMaxVariance="35.601" loudnessBeginMean="-21.352" loudnessBeginVariance="46.039" loudnessDynamicsMean="6.405" loudnessDynamicsVariance="17.239" loudness="-9.965" tempo="101.056" tempoConfidence="0.667" beatVariance="0.140" tatum="0.30029" tatumConfidence="0.771" numTatumsPerBeat="2" timeSignature="4" timeSignatureStability="0.378" timbreMean="41.669 -44.786 -12.202 -27.523 24.267 -8.971 -13.261 -1.825 11.106 1.830 -3.888 11.661 " timbreVariance="38.926 3823.009 3442.412 1742.492 1348.359 722.299 1046.211 537.805 508.504 427.159 309.456 494.871 " pitchMean="0.357 0.166 0.415 0.176 0.266 0.212 0.118 0.236 0.094 0.222 0.167 0.189 " pitchVariance="0.133 0.040 0.162 0.049 0.116 0.093 0.040 0.096 0.022 0.087 0.071 0.076 "/>

            <!-- endOfFadeIn: time value (in sec) giving the end of a possible initial fade in section. Equals 0 when insignificant. -->

            <!-- startOfFadeOut: time value (in sec) giving the beginning of a possible final fade out section -->

            <!-- sizeTimbre: size of the timbre vector description as found throughout this file -->

            <!-- sizePitches: size of the pitch vector description as found throughout this file -->

            <!-- numBeats: estimated number of beats in the music -->

            <!-- numTatums: estimated number of tatums in the music. See below under “Tatum” for details on tatums. -->

            <!-- numSegments: number of onsets (therefore audio segments) found in the music. Segments are later described in a lot more details. -->

            <!-- numSections: estimated number of large sections (e.g. chorus, verse, bridge, break, guitar solo, etc). Sections further described below. -->

            <!-- segmentDurationMean: average segment duration (in sec). Note that segments may vary greatly in duration, typically from 80 to 400 ms. -->

            <!-- segmentDurationVariance: variance of the segment duration (in sec). The smaller the number, the more regular the segment durations. -->

            <!-- timeLoudnessMaxMean: average segment attack duration (in sec). timeLoudnessMax further described below. -->

            <!-- loudnessMaxMean: average segment maximum loudness (in dB). loudnessMax further described below. -->

            <!-- loudnessMaxVariance: variance segment maximum loudness (in decibel — dB). The larger this number the higher the overall music dynamic range. -->

            <!-- loudnessBeginMean: average segment loudness at start (in dB). loudnessBegin is described further below. -->

            <!-- loudnessBeginVariance: variance segment loudness at start (in dB). Roughly related to loudnessMaxVariance -->

            <!-- loudnessDynamicsMean: overall dynamic range estimation on a segment basis (in dB) -->

            <!-- loudnessDynamicsVariance: segment dynamic range variance (in dB). The higher thie value, the less constant the segment dynamics. -->

            <!-- loudness: overall track loudness estimation (in dB) -->

            <!-- tempo: overall track tempo estimation (in beat per minute — BPM). Estimation errors may include doubling or halfing the perceive value. Note however that humans may also disagree on an actual correct answer. -->

            <!-- tempoConfidence: a confidence value of how accurate the tempo may be (beween 0 and 1) -->

            <!-- beatVariance: a value of how regular the beat is (in seconds) -->

            <!-- tatum: estimated overall tatum duration (in seconds). See below under “Tatum” for details on tatums. -->

            <!-- tatumConfidence: a confidence value of how accurate the tatum may be (beween 0 and 1) -->

            <!-- numTatumsPerBeat: number of tatums within a beat -->

            <!-- timeSignature: estimated overall time signature (number of beats per measure). Note this is perceptual measures, as opposed to what the composer might have written on the score. The description goes as follows: 0=NONE, 1=UNKNOWN (perhaps too many variations), 2=2/4, 3=3/4 (eg waltz), 4=4/4 (typical in most pop music), 5=5/4, 6=6/4. 7=7/4 etc. -->

            <!-- timeSignatureStability: a rough estimation of how stable the time signature is throughout the track -->

            <!-- timbreMean: a vector description of the overall timbre mean of segments. See below under “Segment” for details on what timbre numbers represent -->

            <!-- timbreVariance: a vector description of the overall timbre variance of all segments. -->

            <!-- pitchMean: a vector description of the overall pitch content of all segments. See below under “Segment” for details on what pitch numbers represent. -->

            <!-- pitchVariance: a vector description of the overall pitch variance of all segments. -->


        <Tatums>0.00000 0.17714 0.48800 0.80034 1.11555 1.43379 1.75135 2.06859 2.38629 2.70333 3.01812 3.33068 3.64025 3.94717 4.25288 4.55706 4.85940 5.16023 5.46031 5.75920    [ ... ]    446.62841 446.98658 447.34514 447.70533 448.06547

           <!-- Tatums: tatum locations. Tatums are typically sub-divisions of beats, describing the smallest perceptual metrical unit of the music. (in seconds) -->

       </Tatums>

       

        <Section start="0.00000" duration="26.11315"/>

           <!-- Section: sections are the largest chunks in the track, corresponding to major changes in the music, e.g. chorus, verse, bridge, solo, etc. -->

                <!-- start: start of this section (in seconds) -->

                <!-- duration: duration of this section (in seconds) -->


        <Section start="26.11315" duration="24.16456"/>

        <Section start="50.27772" duration="14.53592"/>

        [ ... ]


       <Segment start="0.00000" duration="0.13569">

             <!-- Segment: short sound entity (e.g. 80-400 ms) somewhat timbrally and harmonically uniform, including solo or mixtures of sounds (e.g. a piano note, a guitar chord, a mix of bass, cymbal and voice phoneme, a snare with sax, etc.). A segment is typically defined by the inter onset duration and has the time envelope of an attack and decay -->

                <!-- start: start of this segment (in seconds) -->

                <!-- duration: duration of this segment (in seconds) -->

           <Tags timeLoudnessMax="0.12190" loudnessMax="-59.886" loudnessBegin="-60.000" loudnessEnd="-59.887" pitches="0.125 0.181 0.342 0.552 0.676 0.947 1.000 0.683 0.259 0.059 0.069 0.058 " timbreCoeff="0.008 170.952 9.195 -28.731 57.245 -50.104 15.000 5.174 -27.215 1.026 -10.712 -7.169 "/>

                <!-- timeLoudnessMax: attack duration (in seconds) -->

                <!-- loudnessMax: maximum loudness (in dB) -->

                <!-- loudnessBegin: loudness at the onset (in dB) -->

                <!-- loudnessEnd: loudness at the end of the decay (in dB) -->

                <!-- pitches: a 12-number chroma vector representing the harmonic content of the whole segment, as folded into the 12 pitches of the chromatic scale (from C to B). All numbers range from 0 to 1, with 1 always describing the highest value. As a result, noisy sounds tend to give 12 high values, whereas pitched sounds emphasize the strength of one or few pitch bins. -->

                <!-- timbreCoeff: a 12-number vector describing the timbre of the segment (i.e. the color of the sound) in an eigen space. Because timbre is ill-defined, it is difficult to describe what each of these dimensions precisely represent. However, the 12 dimensions aren’t independently normalized, that is their relevance is directly comparable on a unique scale. Dimensions are ordered by importance of the dimensionality. You can think of each dimension describing a particular aspect of the spectral surface of a sound, i.e. 1st dimension is the average loudness of the segment, 2nd dimension is a rough representation of the  weight of low frequencies, 3rd dimension emphasizes the middle frequencies, 4th dimension is more comparing the attacks, etc. Combined, those 12 dimensions represent a fairly smooth, yet accurate description of the spectral surface of the sound segment, capturing at once both time and frequency evolutions of an auditory (perceptual) spectrogram. The 12 basis functions (i.e. which parts of an audio segment that particular dimensionality best describes) are graphically displayed below:

Copyright © The Echo Nest Corporation, 2005-2008

The timbre of a audio segment is best described by the time-frequency spectral surface of an auditory spectrogram. The X axis represents time, the Y axis is frequency (on a non-linear perceptual scale). Here timbre is equivalent to the linear sum of 12 basis functions (as displayed, where the scale goes from black to white) each multiplied by the corresponding vector coefficient: timbre = c1 * f1 + c2 * f2 + ... + c12 * f12. -->

     </Segment>


        <Segment start="0.13569" duration="0.66816">

           <Tags timeLoudnessMax="0.10449" loudnessMax="-22.364" loudnessBegin="-59.887" loudnessEnd="-29.518" pitches="0.055 0.058 0.917 0.122 1.000 0.050 0.025 0.017 0.038 0.304 0.024 0.053 " timbreCoeff="28.499 -21.488 -91.550 -73.336 23.788 120.651 54.148 -59.556 -15.388 99.226 60.531 -1.781 "/>

        </Segment>

        [ ... ]


    </Track>

</Analysis>

audio analysis for new music applications

Analyze a song >>http://analyze.echonest.com/api
XML Description >>
Technology >>TechnologyDescription.html
Analyze a song >>http://analyze.echonest.com/api
XML Description >>
Technology >>TechnologyDescription.html
The Echo Nesthttp://analyze.echonest.comshapeimage_8_link_0