Free Republic
Browse · Search
General/Chat
Topics · Post Article

Skip to comments.

For Missing Web Pages, a Department of Lost and Found
The New York Times ^ | October 21, 2004 | ANNE EISENBERG

Posted on 10/21/2004 10:18:56 AM PDT by Ernest_at_the_Beach



October 21, 2004
WHAT'S NEXT

For Missing Web Pages, a Department of Lost and Found

By ANNE EISENBERG

THE Web may be an information highway much of the time, but it turns into a cul-de-sac whenever a click leads to a message that the U.R.L. you requested could not be found.

Now a team of university students working as summer interns at an I.B.M. laboratory has devised a prototype for software designed to fix the problem of broken links. The tool checks links between pages to see if they are functioning properly, and if not, searches for the correct pages.

Andrew Flegg, a software engineer at the I.B.M. lab in Hursley, England, and a colleague first came up with a plan for the software, and then directed a team of four students who devised a working prototype.

Mr. Flegg, who used to run Hursley's internal Web site, said that he wanted to deal not only with broken links, but with those where content had become inaccurate or inappropriate.

To do this, the I.B.M. group developed a method for devising "fingerprints," or capsule descriptions of all the links on a site, so that they can be the compared with any future version. "We basically look for what makes a document unique, extract those features and store them," Mr. Flegg said. "Then we can come back and compare the features with the ones we mapped before."

The program checks the fingerprints regularly to make sure that all the links at a Web site remain accurate. "It only needs to compare what's new with what you were happy with last time around," Mr. Flegg said.

If a link does not match the original fingerprint, which is based on the source code of the original page, its context and other information, the next problem the software must solve is determining the level of importance of the change.

Ben P. Delo, who studies mathematics and computer science at Oxford University and was one of the four student interns on the team, said that some minor changes, like spelling, were easy to evaluate. Many other changes, too, that were inherent in the linked sites could be handled fairly simply. For example, a news site could be expected to vary constantly.

But other problems were more challenging, particularly when the program confronted a broken link and had to search among a huge number of pages for a replacement. "It was a chance to apply mathematical knowledge to solve a real problem," Mr. Delo said.

Mr. Delo was attracted by the difficulties of designing the complex algorithms to determine the degree of change in a link, the significance of this change and the best ways to search for the missing links. "We had to have extremely efficient algorithms to deal with the sheer size of the Internet," he said.

During the project, the group gradually solved one search problem after another. Their first prototype handled 10 pages, and then 100 pages. "That was fairly easy," Mr. Delo said. The milestone of 100,000 pages, the current limit, was harder. "We did a lot of tweaking and optimization to get a working system," he said, including experimenting with different ways of storing the data.

Users can decide which pages should be updated automatically and which substitutions require their notification.

I.B.M., which has filed for patents on the software, is deciding on future uses for the prototype. "We are evaluating ways of taking it forward," Mr. Flegg said.

For now, the software is not for sale. One day, though, it might be used by companies to monitor their internal Web sites for accuracy, saving time for Web administrators, who must handle the job manually. Internet service providers might also offer the software as a service to clients.

It might also help users of these sites. "At a big company with millions of documents on its internal Web site," Mr. Flegg said, "if the links are broken, employees can't do their job productively."

If the program grows as expected, it might also be offered as a service to check not only internal Web sites, but the entire Internet. "We believe the system can scale up to the size of the whole Internet with the appropriate computing power behind it," said Mr. Delo, who is now back at Oxford.

Mr. Delo said that one of the most interesting problems he encountered during his internship was in handling links where the U.R.L., the uniform resource locator or Web address, was correct, but the page had been radically changed. One example he cited occurred at the site of a fairly large company that provided links to many clients, one of which went bankrupt. "A pornography site bought out this bankrupt company," he said, and with it, the Web site. So people who clicked on the original link were exposed to pornography.

"You need a system to help manage this risk," he said, so that a company's reputation will not suffer when customers at its Web site are misdirected. "It could be the right U.R.L. but the wrong content," he said.

The I.B.M. software project is called Peridot, after a yellow-green gemstone that, it is said, once helped people find what they sought.

"Now users can find content by making sure links always point to the intended contents," said James Bell, a computer science student at the University of Warwick, another intern on the project. "Our tool will help them find information that might otherwise be lost."

The team also took advantage of the reputed inspirational powers of the stone, acquiring one and consulting it at tense moments during the project. "When we got stuck for ideas," Mr. Delo said, "we'd wave it around to think about solutions."

E-mail: Eisenberg@nytimes.com


TOPICS: Computers/Internet
KEYWORDS: links

1 posted on 10/21/2004 10:18:57 AM PDT by Ernest_at_the_Beach
[ Post Reply | Private Reply | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson