All Packages  Class Hierarchy  This Package  Previous  Next  Index


Class Acme.Spider

java.lang.Object
   |
   +----Acme.Spider

public class Spider
extends Object
implements HtmlObserver, Enumeration
A web-robot class.

This is an Enumeration class that traverses the web starting at a given URL. It fetches HTML files and parses them for new URLs to look at. All files it encounters, HTML or otherwise, are returned by the nextElement() method as a URLConnection.

The traversal is breadth-first, and by default it is limited to files at or below the starting point - same protocol, hostname, and initial directory.

Because of the security restrictions on applets, this is currently only useful from applications.

Sample code:

 Enumeration spider = new Acme.Spider( "http://some.site.com/whatever/" );
 while ( spider.hasMoreElements() )
     {
     URLConnection conn = (URLConnection) spider.nextElement();
     // Then do whatever you like with conn:
     URL thisUrl = conn.getURL();
     String thisUrlStr = thisUrl.toExternalForm();
     String mimeType = conn.getContentType();
     long changed = conn.getLastModified();
     InputStream s = conn.getInputStream();
     // Etc. etc. etc., your code here.
     }
 
There are also a couple of methods you can override via a subclass, to control things like the search limits and what gets done with broken links.

Sample applications that use Acme.Spider:

Fetch the software.
Fetch the entire Acme package.

See Also:
HtmlScanner, NoRobots

Variable Index

 o done
 o err
 o todo

Constructor Index

 o Spider()
Constructor with no size limits, and the default error stream.
 o Spider(int, int)
Constructor with size limits.
 o Spider(int, int, PrintStream)
Constructor with size limits.
 o Spider(PrintStream)
Constructor with no size limits.
 o Spider(String)
Constructor with a single URL and no size limits, and the default error stream.
 o Spider(String, PrintStream)
Constructor with a single URL and no size limits.

Method Index

 o addObserver(HtmlObserver)
Add an extra observer to the scanners we make.
 o addUrl(String)
Add a URL to the to-do list.
 o brokenLink(String, String, String)
This method can be overridden by a subclass if you want to change the broken link policy.
 o doThisUrl(String, int, String)
This method can be overridden by a subclass if you want to change the search policy.
 o gotAHREF(String, URL, Object)
Acme.HtmlObserver callback.
 o gotAREAHREF(String, URL, Object)
Acme.HtmlObserver callback.
 o gotBASEHREF(String, URL, Object)
Acme.HtmlObserver callback.
 o gotBODYBACKGROUND(String, URL, Object)
Acme.HtmlObserver callback.
 o gotFRAMESRC(String, URL, Object)
Acme.HtmlObserver callback.
 o gotIMGSRC(String, URL, Object)
Acme.HtmlObserver callback.
 o gotLINKHREF(String, URL, Object)
Acme.HtmlObserver callback.
 o hasMoreElements()
Standard Enumeration method.
 o main(String[])
Test program.
 o nextElement()
Standard Enumeration method.
 o reportError(String, String, String)
This method can be overridden by a subclass if you want to change the error reporting policy.
 o setAuth(String)
Set the authorization cookie.

Variables

 o err
 protected PrintStream err
 o todo
 protected Queue todo
 o done
 protected Hashtable done

Constructors

 o Spider
 public Spider(PrintStream err)
Constructor with no size limits.

Parameters:
err - the error stream
 o Spider
 public Spider()
Constructor with no size limits, and the default error stream.

 o Spider
 public Spider(String urlStr,
               PrintStream err) throws MalformedURLException
Constructor with a single URL and no size limits.

Parameters:
urlStr - the URL to start off the enumeration
err - the error stream
 o Spider
 public Spider(String urlStr) throws MalformedURLException
Constructor with a single URL and no size limits, and the default error stream.

Parameters:
urlStr - the URL to start off the enumeration
 o Spider
 public Spider(int todoLimit,
               int doneLimit,
               PrintStream err)
Constructor with size limits. This version lets you specify limits on the todo queue and the done hash-table. If you are using Spider for a large, multi-site traversal, then you may need to set these limits to avoid running out of memory. Note that setting a todoLimit means the traversal will not be complete - you may skip some URLs. And setting the doneLimit means it may re-visit some pages.

Guesses at good values for an unlimited traversal: 200000 and 20000. You want the doneLimit pretty small because the hash-table gets checked for every URL, so it will be mostly in memory; the todo queue, on the other hand, is only accessed at the front and back, and so will be mostly paged out.

Parameters:
urlStr - the URL to start off the enumeration
todoLimit - maximum number of URLs to queue for examination
doneLimit - maximum number of URLs to remember having done already
err - the error stream
 o Spider
 public Spider(int todoLimit,
               int doneLimit)
Constructor with size limits.

Parameters:
urlStr - the URL to start off the enumeration
todoLimit - maximum number of URLs to queue for examination
doneLimit - maximum number of URLs to remember having done already

Methods

 o addUrl
 public synchronized void addUrl(String urlStr) throws MalformedURLException
Add a URL to the to-do list.

 o setAuth
 public synchronized void setAuth(String auth_cookie)
Set the authorization cookie.

Syntax is userid:password.

 o addObserver
 public synchronized void addObserver(HtmlObserver observer)
Add an extra observer to the scanners we make. Multiple observers get called in the order they were added.

Alternatively, if you want to add a different observer to each scanner, you can cast the input stream to a scanner and call its add routine, like so:

 InputStream s = conn.getInputStream();
 Acme.HtmlScanner scanner = (Acme.HtmlScanner) s;
 scanner.addObserver( this );
 

 o doThisUrl
 protected boolean doThisUrl(String thisUrlStr,
                             int depth,
                             String baseUrlStr)
This method can be overridden by a subclass if you want to change the search policy. The default version only does URLs that start with the same string as the base URL. An alternate version might instead go by the search depth.

 o brokenLink
 protected void brokenLink(String fromUrlStr,
                           String toUrlStr,
                           String errmsg)
This method can be overridden by a subclass if you want to change the broken link policy. The default version reports the broken link on the error stream. An alternate version might attempt to send mail to the owner of the page with the broken link.

 o reportError
 protected void reportError(String fromUrlStr,
                            String toUrlStr,
                            String errmsg)
This method can be overridden by a subclass if you want to change the error reporting policy. The default version reports the error link on the error stream. An alternate version might ignore the error.

 o hasMoreElements
 public synchronized boolean hasMoreElements()
Standard Enumeration method.

 o nextElement
 public synchronized Object nextElement()
Standard Enumeration method.

 o gotAHREF
 public void gotAHREF(String urlStr,
                      URL contextUrl,
                      Object clientData)
Acme.HtmlObserver callback.

 o gotIMGSRC
 public void gotIMGSRC(String urlStr,
                       URL contextUrl,
                       Object clientData)
Acme.HtmlObserver callback.

 o gotFRAMESRC
 public void gotFRAMESRC(String urlStr,
                         URL contextUrl,
                         Object clientData)
Acme.HtmlObserver callback.

 o gotBASEHREF
 public void gotBASEHREF(String urlStr,
                         URL contextUrl,
                         Object clientData)
Acme.HtmlObserver callback.

 o gotAREAHREF
 public void gotAREAHREF(String urlStr,
                         URL contextUrl,
                         Object clientData)
Acme.HtmlObserver callback.

 o gotLINKHREF
 public void gotLINKHREF(String urlStr,
                         URL contextUrl,
                         Object clientData)
Acme.HtmlObserver callback.

 o gotBODYBACKGROUND
 public void gotBODYBACKGROUND(String urlStr,
                               URL contextUrl,
                               Object clientData)
Acme.HtmlObserver callback.

 o main
 public static void main(String args[])
Test program. Shows URLs, file sizes, etc. at the ACME Java site.


All Packages  Class Hierarchy  This Package  Previous  Next  Index

ACME Java  ACME Labs