I was not clear enough in my question. I did not mean an index as it relates to a database, but an index in the classic sense of a book index.
Have you ever created a human readable index?
An Index for a book is not a Concordance. A good index is more useful, but must be created with human intuition not by software alone. A good index often has cross-references and Double-Posting of synonymous terms.
What I was trying to describe was done to index items that had no real category system in place. It was created after the intuitive arrangement of the items in the catalog by paging through the printed copy and creating index entries from 30 years knowledge of the industry. Trade terms, slang terms and Vendor names where it made sense to use them.
Often catalogs have what looks like an index, but really is not an index. Computer software processes efficiently, but the output often is not that human readable.
For a long time my tag line was: “Bringing order to apparent chaos is the highest form of creativity.”
If you take the naive approach index every lexical word except for stop words
like the
and and
that you can do with a short program in a language like Ruby or Python. But you will indeed end up with something like a concordance simultaneously too much information and not enough. To turn it into a useful index, you would need to work it over manually, removing most of the entries and regrouping and recharacterizing others.
A useful book index is probably beyond the limits of artificial intelligence as it exists today. Maybe Google could take a stab at it. You could submit a manuscript, and Google would shoot it back with all the interesting words and phrases marked, perhaps with interesting categories suggested. Interesting in the light of their global document sea, that is. That would be a start, but you'd still need to review it and add, subtract, and categorize entries.
I just fired up Word, which I hardly ever use (I have Office for the Mac, and I use Excel and sometimes PowerPoint, but hardly ever Word, except to read the occasional Word document). I see it has what looks like a fairly intelligent indexing facility. You highlight the word or phrase, and a pre-filled form pops up. You can hit Return, or you can modify the category, subcategory, etc. So, maybe my hypothetical Google facility could accept Word documents and return them with suggested index markup added in Word format, which allows you to hide the markup when working on other aspects of the document.
Indexing a deeds registry or a vital records office is a much simpler process, conceptually at least. You just need to add the contents of certain well-defined fields to a database, hopefully associated with a high-quality image of the original document. Baby's name, parents' names, hospital, city and state, date and time, unusual circumstances, such as non-hospital or late filing, noted. You don't really need parents' races, but, hey, why not (even if the values are self-reported). Once you've got the database, producing a short form is trivial. And so is producing the long form, if you scanned the originals.