Posted on 02/23/2005 12:16:01 PM PST by Technocrat
I have created a new search technology that will make it much easier to find related articles and avoid duplicates. It is NOT keyword-based - instead, it uses sense and context to determine what an article is actually about, both in major and minor themes. I had about 500 FR articles lying around, so I shoved them into the indexer to see how well it would do, and the results are noteworthy :)
Some notes - everything is in lower case, to speed indexing. No content is actually stored in the database except the title, URL, and a 250 character snippet of internal text - instead, semantic compression reduces the entire contents of the article and first three replies to a 150-dimensional vector in domainspace. As a result, comparisons are blindingly fast (although initial indexing takes about a half second on my creaky old 1999 box) Also, I can't automatically retrieve text from FR to respect the robots policy, so for now, you have to copy and paste it from articles you want to match.
When you run a new article into the engine for comparison, its domain fingerprint is taken and stored for future comparisons. You get to see what that fingerprint is, and then you get three tiers of results: full matches, partial matches, and peripheral matches. Full matches will contain duplicates, articles about the same thing from different sources, and occasionally different articles about very similar things. For fun, you can press the back button and delete or change some text, submit it again, and see how much change it takes for your original submission to drop out of the Primary Match tier. (Don't change the URL, so it doesn't get inserted into the DB again) Secondary or partial matches contain closely related articles (although you may see a few of these in Tertiary or peripheral matches, especially the first few entries). The number in parenthesis is the match cost, or how far away the given reference is from the one you submitted in terms of domain space.
Tertiary matches let you go on a random walk through domain space. Sure, there's some definite relation to what you posted, but you will definitely be venturing afield in much of the linked article. If you want to see a really good example that works well against the current database, post the text from to the search engine.
You can try it by going here (or better yet, open a new browser window and point it at http://www.neurogy.com/sense/compare.html so you can copy and paste articles from this side to see similar articles that are already in my database.
Since this is something completely different, please post requests for format and capability, and I'll see what I can do.
Jim and John, once you see this, feel free to use it forever - I didn't have enough cash to contribute in the last fundraiser :( You can look at my previous donation history to get contact information if you want. I can provide you with simple CGI interface calls to make this part of the posting process, to show posters potential duplicates before they post. If you want to index a significant portion of the recent articles, let me know and I'll make it easier.
Oops - forgot the good example URL. Try http://www.freerepublic.com/focus/f-news/1319996/posts
Bump & Ping
;O)
Thanks to you and every FReeper who helps in one way or another.
That's the idea - they could add a call to the indexer to check on zero (or low) match cost primary tier hits, and if any come up, they could add a "is this a duplicate" box.
Awesome. While you're at it, add a date range filter. Also, can you set up a search engine for replies, as well-- using text from the reply or author (screenname) or date range or all or some of those (like they have at other message boards)?
Kudos to you sir!
Date range - no problem. The replies might be more problematic, as this is a sense-based engine and the entire article with replies gets shrunk down to 150 bytes or so (that's what makes it so stinking fast). Although, if John feels like integrating, I could extend it to do searches like "where did member X post about subject Y?"
Thanks for the vote of confidence, but this is an Alpha engine, and I bet we'll find a lot of holes in it before everyone is happy. That's OK - I needed something to do in the evenings anyway :)
Fascinating
Article Title: Canseco's lack of remorse for steroid use damages kidsArticle URL: http://www.chron.com/cs/CDA/ssistory.mpl/features/3052472
Article Text: This week I tallied up the number of years I've spent on or near a baseball field. Nineteen. ADVERTISEMENT Yep, for almost two decades I've sat in the bleachers, by the dugout or on the sidelines to watch one of my five kids play the Great American Pastime. I figure that, with my youngest at 11, I probably have five or more good seasons left in me before I retire. Which makes me, if not an expert in the game, at least a knowledgeable observer. In other words, I can with some authority tell my son he's hitting the ball late. I can also counsel from personal experience that losing 13-zip is not the end of the world. Like most parents of sports-obsessed children, I've survived the wax and wane of major-league dreams. At some point in his Little League career, each of my four sons thought he would one day make it to the pros.
...and got this back (I wonder if you can trap this and provide a layman's explanation) ...
(That's all I got back. Are you checking for the minimum 1000 characters?)CGI Error
The specified CGI application misbehaved by not returning a complete set of HTTP headers. The headers it did return are:
I expect to see the /posts on the end of the url and right now it will freak out if you try to post anything that isn't on Free Republic (mainly because of a naming convention I used to cut some corners in the initial test). I can change that tonight so you can post from anywhere.
Wow. Will try this later. Thank you.
Suggestion: Put the 'Submit Article' button at the top of the big text box so that I don't have to scroll all the way to the bottom of the page just to click the button (assuming I don't know or neglected to paste the article text first and save the title or URL 'til last, in which case I can just hit the Enter key to submit the form).
I followed you up to there. I can guess, but better not. Whats that mean?
Also, the more text you post, the better your results will be (higher domain dimensionality)
If there were a way to search around dates, or between a date range, that would help.
That is a failing of most major search engines I've tried--no way to date search. Most show the most recent dated articles. But when researching, many times I want matches older than today, or last month, or even last year. To get to the older matches, one has to wade through mounds of the latest date.
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.