Free Republic
Browse · Search
General/Chat
Topics · Post Article

Skip to comments.

More Thoughts on Google and "Trustworthiness"
Evolution News and Views ^ | March 9, 2015 | Erik J. Larson

Posted on 03/10/2015 5:25:51 AM PDT by Heartlander

More Thoughts on Google and "Trustworthiness"

Erik J. Larson March 9, 2015 3:59 PM | Permalink

Last week, I commented on the proposal by a Google team to introduce "trustworthiness" as a criterion in Web searches. See "In Determining Truthfulness, a Google Team Would Like to Do Your Thinking for You." I noted the potential for bias that this "fix" to existing algorithms would bring with it -- a worrisome prospect, perhaps, for Web source not in line with the politically progressive perspective favored by many Google employees.

How worrisome, exactly? For that we need to wade into some of the technical aspects of the paper the team used to advance their idea.

If you read the paper, you find that their group is working on extracting "triples" of the form <Subject, Predicate, Object>. These simple structures are the basis for the facts the Google team wants to identify on the Web. To obtain these triples from source documents, you'd typically use something called "Named Entity Recognition" (NER). NER simply scans text for entities of certain types, like Person, or Country. The full triple extractor will thus use NER as a subroutine, to "bind" the arguments in the triple to appropriately typed objects extracted from the Web pages. So: match Subject to "Obama," Predicate to "president of," and Object to "United States," where Obama is an instance of Person, and United States is an instance of Country and the Predicate can either be stand-alone or a specialization of something like "leader of." None of this is very groundbreaking. There are huge databases that were developed to specify all the vocabulary and constraints on semantic structures like triples (cf. FrameNet).

Extracting triples is semantic and discourse-constrained, in the sense that language issues like indexicality (anaphora), ambiguity, ellipsis, presupposition, metaphor, and many other issues will place fairly strict limits on the general accuracy of triple extraction systems, unless the domain is severely restricted. The Web is not -- it's no wonder this is research for Google.

The example in the paper is the one I just gave: "Obama is president of the United States." They combine this example with: "Obama's nationality is {United States, Kenya}." While not wanting to read too much into their research, this is surely a political example. However what is likely motivating the research group is to make a substantive gain in what is a very difficult problem. The examples are typically chosen with criteria (a) the system has a prayer of working on them, and (b) of general interest, in that order.

The work is fairly interesting. They are using estimation models to minimize the error between correctly extracted false triples ("Obama's nationality is Kenyan," correctly extracted from the source page but false) and incorrectly extracted true triples (same example but here the extractor made the error, not the Web page author). The bigger issue, however, is that the system can't possibly move the ball forward on facticity issues on the Web. Only very restricted domains and very simplistic and semi-structured writing would bridge the research here to a bona fide signal on trustworthiness. I'll give a few examples below. Let's say an author writes this on his blog:

In what follows I'll discuss a few malicious myths prevalent on the Web.

(1) Eric Schmidt of Google has a PhD.

(2) Larry Page went to Stanford.

(3) Terry Winograd was Larry Page's advisor at Stanford.

All three of these true statements are negated by the author's preamble: the noun "myth" of course, as well as the modifier "malicious," the latter suggesting that the blogger is decidedly not trustworthy at all (an emotive response to true propositions). But the information extraction approach in the Google paper wouldn't resolve the correct truth-values. In fact, even given a high-accuracy extraction of (1)-(3) by their system, the statements would get tagged as factual -- assuming Google's Knowledge Vault contains these facts (a good assumption). To get this right, language analysis would have to tackle discourse issues, but the triple extraction system clearly can't do this. This is no fault of Google -- the issues are unresolved in computer science, or more specifically natural language understanding generally. Thornier examples:

(1) Cameron and Tyler Winklevoss are rumored to have really started Facebook.

(2) Cameron and Tyler Winklevoss are known to have really started Facebook.

The two sentences involve fairly sophisticated language analysis. One makes a silly claim, the next a flatly false claim. A lexical analysis of "rumored" and "known" is needed, but this quickly blows up to the question of whether the context of "rumored" is in fact true, in the same sense that if I claim that "John believes that the moon is made of green cheese," and in fact John does believe this (he's just wrong), I make a true statement even while the proposition itself is false. It gets trickier still. For example:

(1) David Cameron is Prime Minister of Great Britain.

(2) Cameron and Tyler Winklevoss are rumored to have really started Facebook.

(3) Cameron and Tyler Winklevoss are known to have started Facebook.

Here, (1) is obviously true. Putting it in the context of (2) and (3) impacts the evaluation of their facticity. 

On and on it goes. The subset of facts that can be reliably extracted will in general be too small to provide a meaningful signal for search. There is a kind of event-horizon beyond which as a researcher you get lost in the endless jungles of language; the systems then see a major degradation of performance or a severe restriction of domain (certainly not the Web). This horizon is usually when indexicality enters in, in other words when the context of an utterance matters to resolving its meaning.

Unfortunately, as Yehoshua Bar-Hillel pointed out decades ago, most of language is indexical. This is why I've been saying that the dirty little secret in language understanding or AI is that the really interesting problems are seen and understood only by minds. The more one understands the depth of the issues, the more the impossibility of computing certain results becomes apparent, and the more interesting and mysterious is our own connection to language. Researchers, of course, have to do research, and that takes funding. So the scope of the problems tends not to get discussed. Yet the simplifications can often involve troubling effects on all of us, even resulting in what amounts to censorship.

A final example:

Intelligent Design is a rival theory to Darwinism.

Is this factual? If you ask Steve Meyer? If you ask Richard Dawkins?

What counts as a fact in the first place? Google says that their Knowledge Vault (a creepy term, I think, where the world's knowledge gets trapped in a corporate "vault") represents a consensus of facts culled from the Web. Is there meaningful consensus on issues where a passionate minority rejects a majority view, and maybe for very good reasons? The vault will certainly not be kind to this minority. The whole point of debate is not to place facts in a locked and armored strongroom, after all, where the light of further discussion, exploration, and discovery cannot reach them. The "facts" may be updated when the consensus changes. But in the meantime, the trustworthiness score of minority viewpoints would be low.

That scares me, as it should every reader interested in the ongoing project of reasoned debate, and in freedom from censorship. The Web, in very subtle ways, does indeed breed censorship. Or at least it certainly can. So I ask again: What counts as a fact? And to whom?



TOPICS: Education; Reference; Society
KEYWORDS:

1 posted on 03/10/2015 5:25:51 AM PDT by Heartlander
[ Post Reply | Private Reply | View Replies]

To: Heartlander
So I ask again: What counts as a fact? And to whom?

Is a Leftist speaking, and does it advance the agenda?

They have poisoned the well to the point where no one believes anything anymore.

2 posted on 03/10/2015 5:30:59 AM PDT by Old Sarge (Its the Sixties all over again, but with crappy music...)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Heartlander

Question: Ecactly WHO is it that vets Google for “trustworthiness”? Is it some geek at Google Central who uses his or her politically correct knowledge base to verify this? Think of Marie Harf as one of these “guardians”,,,and cringe.


3 posted on 03/10/2015 5:36:02 AM PDT by MasterGunner01
[ Post Reply | Private Reply | To 1 | View Replies]

To: Heartlander

Google has changed their algorithms already. When I search for some things, where once there were many relevant choices, now there is very little. I used a simple search for some pyrotechnic formulas, before I got real data, now I don’t.

Some other “standard” searches I perform are also very different now. I suspect it has little to do with trustworthiness or truth, and more like what Google does to protect the Chinese political establishment.


4 posted on 03/10/2015 5:52:17 AM PDT by DBrow
[ Post Reply | Private Reply | To 1 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson