Yep, That's all that's needed.
Write a program to create an index of a small collection of World Wide Web pages. Each "page" is a text file in a special format called HTML (Hypertext Markup Language). The HTML format includes regular text and special HTML commands (tags), which are always enclosed in angle braces. For example, the string <A HREF="layout.htm"> is an HTML link command meaning that the following text should be highlighted, and that any user click on the highlighted text should cause the web browser to fetch and display the file layout.htm.
Your program should open and read the file in the local directory named index.htm and then follow all HTML links to other files referenced within index.htm. The program continues to follow links and read files until all linked web pages have been read.
As each web page is being read, your program has several tasks.
If you are in_between and you read a letter then you start a word_string with the upper case version of that letter and change states to in_a_word. As long as you are in_a_word and continue to read letters, concatenate the lower case versions of those letters with the word_string and continue. If you are in_a_word and read anything other than a letter, put it back, insert the word_string and file_number into the map, and change states to in_between.
If you are in_between and read a opening angle bracket "<" then you are now in_a_tag. If you are in_a_tag you first need to verify that the next sequence of characters is "A HREF=" and if it is, then skip the quotes, read the filename, skip the final quotes and closing angle bracket ">", change states back to in_between, check the filename vector for a match to the newly read filename, and if no match is found, add the new filename to the vector.
If you are in_a_tag and the first seven characters are not "A HREF=" then skip everything until you find the closing angle bracket, and change states back to in_between.
LINE 1 : HTML Files Searched
LINE 2 : -------------------
Next n LINES : file number followed by file name
Next LINE : blank
Next LINE : Words Found in File(s)
Next LINE : ----------------------
Next m LINES : word followed by comma separated list of file
numbers for files within which the word was found
<HTML> <HEAD> <TITLE> Indexing Web Pages </TITLE> <META NAME="Generator" CONTENT="EditPlus"> <META NAME="Author" CONTENT="John H Stewman"> </HEAD> <BODY> <P>Write a program to create an index of a small collection of World Wide Web pages. Each "page" is a text file in a special format called HTML (Hypertext Markup Language). The HTML format includes regular text and special html commands, which are always encloses in angle braces. For example, the string <A HREF="layout.htm">layout.htm</A> is an HTML command meaning that the following text should be highlighted; a user click on the highlighted text would cause a browser to fetch and display the file layout.htm.</P> <H1>Following Links</H1> <P>Don't forget that links can be <A HREF="index.htm"> self-referential</A>!</P> </BODY> </HTML>layout.htm
<A bunch of gibberish and a word> Note that there is no rule that the file needs to be legal HTML (if you know the rules), or that words really be wordseiwlaoieu;a. <A HREF="index.htm">Watch out for mutual references! </HTML>
HTML Files Searched ------------------- 0 -- index.htm 1 -- layout.htm Words Found in File(s) ---------------------- A 0,1 Always 0 An 0 And 0 Angle 0 Are 0 Be 0,1 Braces 0 Browser 0 Called 0 Can 0 Cause 0 Click 0 Collection 0 Command 0 Commands 0 Create 0 Display 0 Don 0 Each 0 Encloses 0 Example 0 Fetch 0 File 0,1 Following 0 For 0,1 Forget 0 Format 0 Highlighted 0 Htm 0 Html 0,1 Hypertext 0 If 1 In 0 Includes 0 Index 0 Indexing 0 Is 0,1 Know 1 Language 0 Layout 0 Legal 1 Links 0 Markup 0 Meaning 0 Mutual 1 Needs 1 No 1 Note 1 Of 0 On 0 Or 1 Out 1 Page 0 Pages 0 Program 0 Really 1 References 1 Referential 0 Regular 0 Rule 1 Rules 1 Self 0 Should 0 Small 0 Special 0 String 0 T 0 Text 0 That 0,1 The 0,1 There 1 To 0,1 User 0 Watch 1 Web 0 Which 0 Wide 0 Words 1 Wordseiwlaoieu 1 World 0 Would 0 Write 0 You 1