This speech-gaze data set was collected in the treasure hunting domain.
<object name="apple"> <properties> <type text="apple#n#1">apple fruit</type> <color text="color#n#1">red</color> </properties> </object> ...In the above example,
apple_1 -> apple apple_2 -> apple apple_3 -> appleIn domain_nnjj.xml, the names of the objects are converted names. For example, there are no definitions for objects "apple_1", "apple_2", or "apple_3". Instead, only one object "apple" is defined.
... <record> <gaze_fixation start="12868123630593" length="20" pos="114,21"> <mesh>castle</mesh> </gaze_fixation> <scene> <mesh pos="308,629" rect="469,189,641,410">beerMug</mesh> <mesh pos="51,555" rect="529,98,577,323">plate_4</mesh> <mesh pos="0,584" rect="552,0,645,157">plate_2</mesh> <mesh pos="365,551" rect="435,242,668,486">door_dining</mesh> </scene> </record> ...In the above example,
0 - pause scene; 1 - resume scene; 2 - move left; 3 - move right; 4 - move up; 5 - move down; 6 - rotate camera; 7 - move object; 8 - pick object; 9 - reset camera.
<user_input> <speech> <transcript>it's a wooden desk with three drawers on the right side</transcript> <wavefile>audio\20081010-105510-601.wav</wavefile> <phrase start="12868124101485" length="8380" rank="0"> <token start="4290" length="390">and</token> <token start="4680" length="190">say</token> <token start="4870" length="360">wooden</token> <token start="5230" length="370">desk</token> <token start="5600" length="440">with</token> <token start="6070" length="380">three</token> <token start="6450" length="630">drawers</token> <token start="7170" length="260">from</token> <token start="7430" length="130">the</token> <token start="7560" length="300">right</token> <token start="7860" length="470">side</token> </phrase> <phrase start="12868124101485" length="8380" rank="1"> ... </phrase> ... </speech> <gaze> <gaze_fixation start="12868124100940" length="40" pos="568,364"> <mesh prob="0.55">computer_monitor</mesh> <mesh prob="0.27">desk_75</mesh> <mesh prob="0.18">castle</mesh> </gaze_fixation> <gaze_fixation start="12868124101080" length="20" pos="771,375"> <mesh prob="0.67">computer_body</mesh> <mesh prob="0.33">castle</mesh> </gaze_fixation> <gaze_fixation start="12868124101160" length="59" pos="772,409"> <mesh prob="0.55">desk_75</mesh> <mesh prob="0.27">desk_drawer1</mesh> <mesh prob="0.18">castle</mesh> </gaze_fixation> ... </gaze> </user_input>In the above example,
<user_input> <speech utt_id="20081010-105510-601" matched="1"> <transcript>it's a wooden desk with three drawers on the right side</transcript> <phrase start="12868124101485" length="8380"> <token start="4290" length="390">it's</token> <token start="4680" length="190">a</token> <token start="4870" length="360">wooden</token> <token start="5230" length="370">desk</token> <token start="5600" length="440">with</token> <token start="6070" length="380">three</token> <token start="6450" length="630">drawers</token> <token start="7190" length="240">on</token> <token start="7430" length="130">the</token> <token start="7560" length="300">right</token> <token start="7860" length="390">side</token> </phrase> </speech> <gaze> <gaze_fixation start="12868124100940" length="40"> <mesh prob="0.550000">computer_monitor</mesh> <mesh prob="0.270000">desk_75</mesh> <mesh prob="0.180000">castle</mesh> </gaze_fixation> <gaze_fixation start="12868124101080" length="20"> <mesh prob="0.670000">computer_body</mesh> <mesh prob="0.330000">castle</mesh> </gaze_fixation> <gaze_fixation start="12868124101160" length="59"> <mesh prob="0.550000">desk_75</mesh> <mesh prob="0.270000">desk_drawer1</mesh> <mesh prob="0.180000">castle</mesh> </gaze_fixation> ... </gaze> </user_input>In the above example,