Mailing List

Enter your Email


Powered by FeedBlitz

RSS Feed

Entrecard

Links

Blogshares Links

Beginner's Guide to BlogShares
A guide about the BlogShares fantasy blog stock market.
Scared Bunny
BlogShares Price Tracker
This program that archives information about the BlogShares fantasy stock market. You can view graphs of any industry, and analyze your portfolio.

SproutWorks Projects

Digg Archive
A new experimental Digg page.
AJAX Pixel Editor
A Collaborative pixel editor currently in development.
Web promotion links
These tools help you get visitors on your website.
SproutPics
My photography Site
SproutZoo
My zoo photographs
Tag Cloud
A summary of tagged articles.
Found Photos
An automated page that thumbnails photos from another site.
SproutSearch
I designed this blog indexing tool, and it has accumulated over 6 million blogs so far.
Products
Some of the programs I've written.
RSS Feeds
RSS Feeds from the SproutWorks Forums
SproutTree Demo
A demo of a tree-drawing PHP script.
My Gallery

SproutWorks Chat
A chat room I programmed, most likely empty.
Link Exchange - Link Directory - Web Hosting

Sign In

Username:
Password:
Remember Me

sprout man
Forums/

sproutworks
September 25th, 2006 5:23 AM PST
My blog search engine SproutSearch is now indexing over 8 million blogs. I am now working on changing the way the blogs are ranked. For now, they are sorted by the sheer amount of content they contain. I noticed a big problem with this method is that many spam blogs contain masses of content. I don't like SproutSearch linking to so much spam, so I need to find a way to remove a lot of these listings.

It is not practical for me to read 8 million blogs, so I need to come up with an automated method to detect spam. Many spam blogs use the same words over and over. So I wrote a program to count the number of repeated words. Most spam blogs seem to use a similar number of words per post. I made another program that computes the standard deviation of the number of words in a post. Using these metrics, I will make a program that flags potential spam so I can review and delete it.