Everything I Learned about RSS, Page Parsing, and Newsfeed Syndication

by Mike Shea on 14 September 2005

Today I took down a section of my website that I've hosted for nearly six years, my Newsfeeds section. I did it for a variety of reasons, but one big one is that it used up about 200 megs of bandwidth a day. Now that I am hosting Mobhunter on my own servers, I felt it was time to let go of this service and let people seek other ways to run their newsfeeds.

I've learned a lot about newsfeeds in the last six years and I don't want to walk away without leaving behind some of this vast knowledge to better humanity. So here are the things I learned about newsfeeds:

Page Parsing Works Best

Nearly all of my newsfeed stuff came from a variety of scripts written to parse pages. The best method I found for parsing HTML from a page was with a nice little perl module called "HTMLTokeparser". This very simple page parser can easily hunt down hyperlinks within a document and return the results in an array.
How to Save a remote page locally

There are a few methods to save pages locally. One key, always do this using an automated process like a cron. Never fetch and parse pages as users hit your site. It seems easy to set up a PHP script that will fetch a page, parse it, and return the results in HTML or XML until you start getting 10,000 hits a day. Now site owners are pissed at you, your ISP is pissed at you, and your CPUs hate you. Instead, fetch a page, parse it, and save the results locally. When a user hits you, give them the static page instead of a dynamic parse.

There are two good methods to fetch a page locally; unix wget and perl's getstore.

Wget looks like this:

wget http://fly.srk.fer.hr/

That will fetch that page and save it locally. Wget has a ton of featuers including a full spider so go google it to learn more.

Getstore looks like this:

!/usr/bin/perl -w

use CGI qw(:standard); use LWP::Simple; getstore ("http://mikeshea.net","mikeshea.html"); exit;

Both of these methos are great when used as a cronjob to safe files locally. Remember above all, do not fetch pages on page hits. It's evil.

How to Cache newsfeeds

The best way to create an rss script or a tiny HTML page with recent headlines is to write a perlscript that fetches the page, parses the page, turns links and titles into either hrefs or rss items, and save the results locally. Then run a cronjob to do this every 30 minutes to an hour.

I've tried dozens of different ways to do newsfeed and headline syndication and the above method is the best I've seen. Just remember, don't get involved with newsfeeds unless you are prepared for a lot of hits. I used to get 200 megs of downloaded newsfeeds each day.

Here is a text listing of the scripts I used to make those newsfeeds:rss2html.pl feed_cache.pl pagelist.txt ezboard2rss.pl ripper2rss.phpGood luck!