*_transcription.xml -- file containing the user's multimodal input: speech (transcript and n-best speech recognition)
and the accompanying gestures. Each utterance is also annotated with the entities referred in the utteracne.
An example:
<curve start="4960" end="5116">
<point>296 439</point>
<point>296 439</point>
<entity text="bedroom">0.015400</entity>
<entity text="lamp_floor">0.963800</entity>
<entity text="table_pc">0.020800</entity>
<transcription>place the lamp on all four feet</transcription>
<phrase rank="0">
<timestamp start="12771369800700" length="7580"/>
<token text="to">
<timestamp start="4110" length="70"/>
<phonemes>t ax</phonemes>
<token text="place">
<timestamp start="5820" length="340"/>
<phonemes>p l ey s</phonemes>
<token text="the">
<timestamp start="6160" length="90"/>
<phonemes>dh ax</phonemes>
<token text="lamp">
<timestamp start="6250" length="340"/>
<phonemes>l ae m p</phonemes>
<token text="on">
<timestamp start="6590" length="160"/>
<phonemes>ao n</phonemes>
<token text="all">
<timestamp start="6750" length="190"/>
<phonemes>ao l</phonemes>
<token text="four">
<timestamp start="6940" length="230"/>
<phonemes>f ao r</phonemes>
<token text="feet">
<timestamp start="7170" length="410"/>
<phonemes>f iy t</phonemes>
<phrase rank="1">
In the above example:
- The user's speech starts at "12771369800700" (system time in ms).
- The user's gesture starts at "4960" (offset from speech start time) and points to a location "<point>296 439</point>" (screen coordinates).
- The possibly selected objects with their seleciton probabilities are given in the <selection/> tag.
- Each <phrase/> tag contains a recognition hypothesis (Microsoft speech recognizer). Each token in the hypothesis is timestamed (offset from the speech start time in ms).
- The referred entitiy in the user's utterance is given in the <entity_annotation/> tag.