Indexing Web Pages -- CSE331 -- Program # 5

DUE Tuesday, November 9, 2004, by MIDNIGHT


Submit the following in <your dropbox>/prog5

prog5.cpp -- your source code (+ other files if you use more)
Makefile -- to build your executable (prog5)

Yep, That's all that's needed.


This assignment is a slight modification of a problem that was given to the student programming teams at the ACM 1996 Southeast Regional Programming Contest in Orlando, FL.

Write a program to create an index of a small collection of World Wide Web pages. Each "page" is a text file in a special format called HTML (Hypertext Markup Language). The HTML format includes regular text and special HTML commands (tags), which are always enclosed in angle braces. For example, the string <A HREF="layout.htm"> is an HTML link command meaning that the following text should be highlighted, and that any user click on the highlighted text should cause the web browser to fetch and display the file layout.htm.

Your program should open and read the file in the local directory named index.htm and then follow all HTML links to other files referenced within index.htm. The program continues to follow links and read files until all linked web pages have been read.

As each web page is being read, your program has several tasks.

Assumptions

Some useful ideas and Requirements

How to find a word (this is similar to the concordance example program in your textbook)

At any point during the reading of the file you are either in_between, in_a_word, or in_a_tag. The reading of a file starts in_between.

If you are in_between and you read a letter then you start a word_string with the upper case version of that letter and change states to in_a_word. As long as you are in_a_word and continue to read letters, concatenate the lower case versions of those letters with the word_string and continue. If you are in_a_word and read anything other than a letter, put it back, insert the word_string and file_number into the map, and change states to in_between.

If you are in_between and read a opening angle bracket "<" then you are now in_a_tag. If you are in_a_tag you first need to verify that the next sequence of characters is "A HREF=" and if it is, then skip the quotes, read the filename, skip the final quotes and closing angle bracket ">", change states back to in_between, check the filename vector for a match to the newly read filename, and if no match is found, add the new filename to the vector.

If you are in_a_tag and the first seven characters are not "A HREF=" then skip everything until you find the closing angle bracket, and change states back to in_between.

Useful C++

What about those files?

As long as there are filenames in the filename vector for files which have not been read, the main program opens a new file, reads it (extracting words and filenames), closes it and clears the input file stream (in case any errors occurred), and then goes on to open another ... and so on, and so forth.

Input

The initial HTML file you should read is named index.htm.

Output

Create a file named "wordlist.txt" that has the following format. This assumes you found n files including index.htm and that those files contained m words total.
LINE  1          : HTML Files Searched
LINE  2          : -------------------
Next n LINES     : file number followed by file name
Next LINE        : blank
Next LINE        : Words Found in File(s)
Next LINE        : ----------------------
Next m LINES     : word followed by comma separated list of file 
                   numbers for files within which the word was found

Sample Inputs

index.htm
<HTML>
<HEAD>
<TITLE> Indexing Web Pages </TITLE>
<META NAME="Generator" CONTENT="EditPlus">
<META NAME="Author" CONTENT="John H Stewman">
</HEAD>

<BODY>
<P>Write a program to create an index of a small collection
of World Wide Web pages.  Each "page" is a text file in a
special format called HTML (Hypertext Markup Language).  The
HTML format includes regular text and special html commands,
which are always encloses in angle braces.  For example, the
string <A HREF="layout.htm">layout.htm</A> is an HTML command meaning that
the following text should be highlighted;  a user click on
the highlighted text would cause a browser to fetch and
display the file layout.htm.</P>

<H1>Following Links</H1>
<P>Don't forget that links can be <A HREF="index.htm">
self-referential</A>!</P>
</BODY>
</HTML>
layout.htm
<A bunch of gibberish and a word>
Note that there is no rule that the file needs to be legal
HTML (if you know the rules), or that words really be
wordseiwlaoieu;a.  <A HREF="index.htm">Watch out for mutual 
references!
</HTML>

Sample Output

wordlist.txt
HTML Files Searched
-------------------
  0 -- index.htm
  1 -- layout.htm

Words Found in File(s)
----------------------
A  0,1
Always  0
An  0
And  0
Angle  0
Are  0
Be  0,1
Braces  0
Browser  0
Called  0
Can  0
Cause  0
Click  0
Collection  0
Command  0
Commands  0
Create  0
Display  0
Don  0
Each  0
Encloses  0
Example  0
Fetch  0
File  0,1
Following  0
For  0,1
Forget  0
Format  0
Highlighted  0
Htm  0
Html  0,1
Hypertext  0
If  1
In  0
Includes  0
Index  0
Indexing  0
Is  0,1
Know  1
Language  0
Layout  0
Legal  1
Links  0
Markup  0
Meaning  0
Mutual  1
Needs  1
No  1
Note  1
Of  0
On  0
Or  1
Out  1
Page  0
Pages  0
Program  0
Really  1
References  1
Referential  0
Regular  0
Rule  1
Rules  1
Self  0
Should  0
Small  0
Special  0
String  0
T  0
Text  0
That  0,1
The  0,1
There  1
To  0,1
User  0
Watch  1
Web  0
Which  0
Wide  0
Words  1
Wordseiwlaoieu  1
World  0
Would  0
Write  0
You  1