Jack Krupansky on Blogging: Is the distinction between a blog search engine and a traditional search engine growing meaningless?

In his blog post on his Blogspotting blog entitled "Remove "blogs" from the headline?" Senior Writer Stephen Baker of BusinessWeek opines that "the distinction between a blog search engine and a traditional one is growing meaningless" (actually, I think he is attributing that to Steve Rubel.) Superficially, I think that is absolutely correct, but there are some nuances.

What exactly is a "blog search engine"? Presumably it is something like Technorati, Bloglines Search, Google Blog Search, or Micrsoft Windows Live Search Feeds, among others. But, do any of these so-called "blog search engines" really search through all blog posts? Some of them may, but in general they are only indexing and searching the content of the feed (e.g., RSS feed) associated with a blog.

Although so-called "blog search" (feed search) is indeed quite useful, especially when looking for timely content, there are several difficulties when contrasted with generic Web search.

First, the popular usage is that RSS stands for Really Simple Syndication. That is okay and all well and good for popular, non-technical consumption, but it belies what is really going on. The hard-core, technical meaning of RSS is RDF Site Summary (where RDF stands for Resource Description Framework.) The semi-technical meaning of RSS is Rich Site Summary. The operative word from either of these two proper meanings is summary, indicating that the feed contains only a subset of the content of a site (such as a blog.)

The issue of summary comes in three forms. First, only of subset of the number of content items may be present in the feed. Some feeds may have a specific numeric limit such as 10 or 20 items. Some feeds may be time limited, such as all content within the past day, week, two weeks, months, 90 days, etc. In short, a blog search engine typically does not index all of the posts that exist on a blog and its archives.

A second form of summary is for the feed to have an abstract summary of the main content or an extended version of the headline. This is very common for traditional news stories or feature-length articles where the feed is designed for browsing, in contrast with reading the full content.

A third form of summary which is used by some blogs is a short intro of the post and then they insist that you click on a "More" link to read the full post. The author of such as blog then usaully has the choice of having only the summary in the feed or to put the full post in the feed. There are great debates in the blogosphere on the merits of those two approaches. The author may opt to include only the summary in the feed as a "teaser" to help drive traffic to their main site. On the positive side, short summaries greatly facilitate rapid browsing of large numbers of posts in feed readers.

The issue for all forms of summary is that a blog or feed search engine that looks only at the feed will not have the full blog post text to index and search.

OTOH, a feed has a very structured format for tags and other blog metadata that can facilitate more refined indexing and searching of blog posts, at least for the blog posts that are included in the feed.

In general, for each blog post there will also be a distinct permalink web page which has all of the text of the blog post in a form that has its own URL address and can be fully indexed and searched by a traditional, generic Web search engine. The tag information for a post is also there, but not always in a uniform, structured format that the non-blog search engine can index and search anywhere near as well as a pure feed-based search engine.

One important distinction between using a blog search engine and a traditional, generic Web search engine is that the latter tend to order search results strictly by so-called relevance, while the former frequently also allow you to order the results by date (reverse chronological as with a traditional blog) or a mixture of relevance and date, so that you can easily access timely content even if the top authoritative content may be too old to even show up in a typical web feed.

Technically, there is no good reason that a Web search engine could not also permit these options, but today they (in general) do not. Actually, there are some technical difficulties with date-ordering the general Web, but I suspect that most of them could be worked around (to some extent) if the search engine teams put their hearts into it.

What I would really like to see is for each of the main Web search engines to have options as to what mix of blog and non-blog content to display along with a reverse date weighting. Even with the general Web I frequently want to search for fresh content simply because I have already seen all the old stuff and want to see the new stuff or stuff that I haven't marked as read. That alone is a great argument for the need for blog search engines to continue to exist, but if the big search engines did just a little better job of integrating blog search the need for separate blog search engines would vanish.

In summary, for now I still find blog search engines (marginally) useful, but their days could be numbered, especially if they do not try to break out of their niche and start adding value beyond simply indexing and searching the (limited) content of blog feeds.

-- Jack Krupansky

Jack Krupansky on Blogging

Saturday, January 26, 2008

Is the distinction between a blog search engine and a traditional search engine growing meaningless?

0 Comments:

About Me

Previous Posts