Michael Friis' Blog

About


Google sampled my voice and all I got was this lousy T-shirt!

I’ve just submitted a voice-sample to help Google in their efforts to build Danish-language voice search. See what voice search is about in this video. In case anyone is interested, here’s how Google goes about collecting these samples.
The sampling was carried out by a Danish-speaker hired by Google for the specific task. The sampling was done in a crowded Copenhagen coffee-shop (Baresso at Strøget, near Googles Copenhagen sales office) with people talking, coffee-machines hissing and music in the background. This is likely to ensure that samples are collected in an environment similar to the one where voice search will be used.

The samples were recorded on a stock Google Nexus One using an Android application called “dataHound”. The sampling basically involved me reading 500 random terms, presumably search terms harvested from Google searches. Most were one-word phrases but there some multi-word ones too (this likely reflects the fact that most users only search using single words). The Googler said that it was due to the sensitive nature of these terms (and the risk of harvesting presumably) that the sampling had to be carried out in-person. Google apparently requires 500 of these 500-word samples to form a language-corpus (I was number 50).

The dataHound app displayed the term to be spoken at the top with a bunch of buttons at the bottom. One button advanced the app to the next term, one could be pressed if the term was completely unintelligible and one could be used if the term was offensive to you and you did not want to say it out loud (I had no such qualms). The interface was pretty rough but the app was fast.

The terms were all over the place. I work for Ekstra Bladet (a Danish tabloid) and noted our name cropped up twice. “Nationen” (our debate sub-site) showed up once. Other Danish media sites were well represented and there were many locality-relevant searches. There were also a lot of domain-names, presumably Google expect people to use Google Voice Search over typing in a url themselves (indeed, people already do this on google.com).

Among the terms were also “Fisse” (the Danish word for “cunt”), “tisse kone” (a more polite synonym for female genitals), “ak-47” and “aluha ackbar”. If Google prompts you to say “cunt” in a public place, how can you refuse?

The googler told me that she’s looking for more volunteers, so drop her a line of you speak Danish and live in Copenhagen: [email protected]. Plus, you get a Google T-shirt for your efforts!