Leave it to a computer to rub our noses in the nitty-gritty of semantics. Mostly, as we go about our day, we toss out words and sentences to our fellow language speakers and communication takes place, more or less. Building all this rich linguistic environment into a machine is notoriously difficult and fraught with unintended consequences. I’ll share a few such examples from the domain of morphology, or the word-building component, for one real natural language understanding system. This system is no word-stemming toy that merely lops the ends off words and hopes something useful is left to work with. No, it’s rich with a full dictionary-sized vocabulary, with representations for the part of speech (am I a noun, verb, adjective or what), and a rich set of affixes (prefixes and suffixes in English) and their meanings, sometimes multiple meanings for a given form. If you want to get practical work done, the system can be shown language samples and simply match the tokens in the sample to forms in the rich dictionary. If we find a match, we have the semantics, the meaning we need to understand the token. (Let’s ignore ambiguous meanings for the moment, which can muddy the waters considerably.)
But, suppose you want to explore the question of what it takes for a computer to learn new words that are not already in its dictionary, based on what it already knows. One experiment to try would be to ‘turn off’ access to the dictionary and just let the computer work with its word-formation rules. This is interesting because it will shed light on how robust and accurate the system’s existing word-formation rules are, and where modifications might improve the system’s linguistic abilities. But, another interesting result of such an experiment is to highlight for us, who are the gold standard of our own human languages, some very plausible analyses of our own semantics that we probably have never considered until the alien mind of a cyber being churns them out, based on the only rules it has been given by us.
Consider the following, which are actual analyses given by the above-mentioned natural language understanding system, with its dictionary ‘off’ and only its word-formation rules available to it. (I should mention it actually succeeded in many, many accurate word analyses.) The sentences are my effort to put the ‘new words’ into context. 🙂
caress –> car (noun) + -ess (feminine suffix) ‘a female car’ The tanks on the left at the gas station are for cars, the ones on the right are for caresses.
infantry –> infant (noun) + -ry (having to do with) ‘all things infantile’ The infantry of their behavior after losing the game was ridiculous.
peridotite –> peridot (noun) + -ite (kind of person) ‘a person made of olivine’ Last night we watched ‘Invasion of the Peridotites’ at the drive-in theater.
figurine –> fig (noun) + urine (noun) compound: ‘fig excretion’ When the fruit becomes overripe it contains toxic levels of figurine.
address –> adder (noun – system inferred ‘add’ then ‘adder’) + -ess (that feminine suffix again!) ‘a female adder’ The addresses outperformed the adders on four of the six general ledger tasks.
pigeon –> pig (noun) + eon (noun) compound: ‘the age of peccaries’ The pigeon was characterized by lots of foraging and wallowing in the mud.
pigeonite –> pigeon (noun this time!) + -ite (kind of person) ‘pigeon-person’ Lately, the pigeonites have been crowding out the pigeons under the bridges in the park.
Note: the system knows two meanings for -ite (mineral, anthracite, pigeonite) and (kind of person, Hittite, socialite), but was favoring the second meaning in its analyses.
Isn’t it kind of amazing that we don’t stumble over our own semantic shoelaces more often?
Update: Okay, one more. cashier –> cash (noun) + -ier (comparative adjective form) ‘possessing more cash’ (I was wrong to infer ‘more expensive’) Mary’s new boyfriend is cashier than her former one.