Wednesday, July 6, 2016

Using Python to access web data Week 4 Following Links in HTML Using BeautifulSoup

The problem-
In this assignment you will write a Python program that expands on The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment
  • Sample problem: Start at 
    Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.
    Sequence of names: Fikret Montgomery Mhairade Butchi Anayah 
    Last name in sequence: Anayah
  • Actual problem: Start at: 
    Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
    Hint: The first character of the name of the last page that you will load is: L

My submission- 

 import urllib  
 from BeautifulSoup import *  
 url = raw_input('Enter URL ')  
 count = raw_input('Enter count: ')  
 count = int(count)  
 pos = raw_input('Enter position: ')  
 pos = int(pos)-1  
 html = urllib.urlopen(url).read()  
 seq = ''  
 for i in range(0,count):  
   soup = BeautifulSoup(html)  
   tags = soup('a')  
   seq = seq + tags[pos].contents[0]+' '  
   html = urllib.urlopen(tags[pos].get('href', None)).read()  
 print"Sequence of names- ", seq  
 print 'Last name in sequence -', tags[pos].contents[0]  


  1. seq = seq + tags[pos].contents[0]+' '
    what this step will do

    what is contents[0]+' '

    1. could you please explain the algorithm? I seem to be having trouble understanding the algorithm. Thanks.

  2. Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
    Hint: The first character of the name of the last page that you will load is: Z
