More Thoughts on Google and "Trustworthiness"

Erik J. Larson March 9, 2015 3:59 PM | Permalink

Last week, I commented on the proposal by a Google team to introduce "trustworthiness" as a criterion in Web searches. See "In Determining Truthfulness, a Google Team Would Like to Do Your Thinking for You." I noted the potential for bias that this "fix" to existing algorithms would bring with it -- a worrisome prospect, perhaps, for Web source not in line with the politically progressive perspective favored by many Google employees.

How worrisome, exactly? For that we need to wade into some of the technical aspects of the paper the team used to advance their idea.

If you read the paper, you find that their group is working on extracting "triples" of the form <Subject, Predicate, Object>. These simple structures are the basis for the facts the Google team wants to identify on the Web. To obtain these triples from source documents, you'd typically use something called "Named Entity Recognition" (NER). NER simply scans text for entities of certain types, like Person, or Country. The full triple extractor will thus use NER as a subroutine, to "bind" the arguments in the triple to appropriately typed objects extracted from the Web pages. So: match Subject to "Obama," Predicate to "president of," and Object to "United States," where Obama is an instance of Person, and United States is an instance of Country and the Predicate can either be stand-alone or a specialization of something like "leader of." None of this is very groundbreaking. There are huge databases that were developed to specify all the vocabulary and constraints on semantic structures like triples (cf. FrameNet).

Extracting triples is semantic and discourse-constrained, in the sense that language issues like indexicality (anaphora), ambiguity, ellipsis, presupposition, metaphor, and many other issues will place fairly strict limits on the general accuracy of triple extraction systems, unless the domain is severely restricted. The Web is not -- it's no wonder this is research for Google.

The example in the paper is the one I just gave: "Obama is president of the United States." They combine this example with: "Obama's nationality is {United States, Kenya}." While not wanting to read too much into their research, this is surely a political example. However what is likely motivating the research group is to make a substantive gain in what is a very difficult problem. The examples are typically chosen with criteria (a) the system has a prayer of working on them, and (b) of general interest, in that order.

The work is fairly interesting. They are using estimation models to minimize the error between correctly extracted false triples ("Obama's nationality is Kenyan," correctly extracted from the source page but false) and incorrectly extracted true triples (same example but here the extractor made the error, not the Web page author). The bigger issue, however, is that the system can't possibly move the ball forward on facticity issues on the Web. Only very restricted domains and very simplistic and semi-structured writing would bridge the research here to a bona fide signal on trustworthiness. I'll give a few examples below. Let's say an author writes this on his blog:

In what follows I'll discuss a few malicious myths prevalent on the Web.
(1) Eric Schmidt of Google has a PhD.
(2) Larry Page went to Stanford.
(3) Terry Winograd was Larry Page's advisor at Stanford.

All three of these true statements are negated by the author's preamble: the noun "myth" of course, as well as the modifier "malicious," the latter suggesting that the blogger is decidedly not trustworthy at all (an emotive response to true propositions). But the information extraction approach in the Google paper wouldn't resolve the correct truth-values. In fact, even given a high-accuracy extraction of (1)-(3) by their system, the statements would get tagged as factual -- assuming Google's Knowledge Vault contains these facts (a good assumption). To get this right, language analysis would have to tackle discourse issues, but the triple extraction system clearly can't do this. This is no fault of Google -- the issues are unresolved in computer science, or more specifically natural language understanding generally. Thornier examples:

(1) Cameron and Tyler Winklevoss are rumored to have really started Facebook.
(2) Cameron and Tyler Winklevoss are known to have really started Facebook.

The two sentences involve fairly sophisticated language analysis. One makes a silly claim, the next a flatly false claim. A lexical analysis of "rumored" and "known" is needed, but this quickly blows up to the question of whether the context of "rumored" is in fact true, in the same sense that if I claim that "John believes that the moon is made of green cheese," and in fact John does believe this (he's just wrong), I make a true statement even while the proposition itself is false. It gets trickier still. For example:

(1) David Cameron is Prime Minister of Great Britain.
(2) Cameron and Tyler Winklevoss are rumored to have really started Facebook.
(3) Cameron and Tyler Winklevoss are known to have started Facebook.

Here, (1) is obviously true. Putting it in the context of (2) and (3) impacts the evaluation of their facticity.

On and on it goes. The subset of facts that can be reliably extracted will in general be too small to provide a meaningful signal for search. There is a kind of event-horizon beyond which as a researcher you get lost in the endless jungles of language; the systems then see a major degradation of performance or a severe restriction of domain (certainly not the Web). This horizon is usually when indexicality enters in, in other words when the context of an utterance matters to resolving its meaning.

Unfortunately, as Yehoshua Bar-Hillel pointed out decades ago, most of language is indexical. This is why I've been saying that the dirty little secret in language understanding or AI is that the really interesting problems are seen and understood only by minds. The more one understands the depth of the issues, the more the impossibility of computing certain results becomes apparent, and the more interesting and mysterious is our own connection to language. Researchers, of course, have to do research, and that takes funding. So the scope of the problems tends not to get discussed. Yet the simplifications can often involve troubling effects on all of us, even resulting in what amounts to censorship.

A final example:

Intelligent Design is a rival theory to Darwinism.

Is this factual? If you ask Steve Meyer? If you ask Richard Dawkins?

What counts as a fact in the first place? Google says that their Knowledge Vault (a creepy term, I think, where the world's knowledge gets trapped in a corporate "vault") represents a consensus of facts culled from the Web. Is there meaningful consensus on issues where a passionate minority rejects a majority view, and maybe for very good reasons? The vault will certainly not be kind to this minority. The whole point of debate is not to place facts in a locked and armored strongroom, after all, where the light of further discussion, exploration, and discovery cannot reach them. The "facts" may be updated when the consensus changes. But in the meantime, the trustworthiness score of minority viewpoints would be low.

That scares me, as it should every reader interested in the ongoing project of reasoned debate, and in freedom from censorship. The Web, in very subtle ways, does indeed breed censorship. Or at least it certainly can. So I ask again: What counts as a fact? And to whom?

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.