Wednesday, July 6, 2016

Using python to access Web data Week 4 Scraping HTML with BeautifulSoup

The question-
he file is a table of names and comment counts. You can ignore most of the data in the file except for lines like the following:
<tr><td>Modu</td><td><span class="comments">90</span></td></tr>
<tr><td>Kenzie</td><td><span class="comments">88</span></td></tr>
<tr><td>Hubert</td><td><span class="comments">87</span></td></tr>
You are to find all the <span> tags in the file and pull out the numbers from the tag and sum the numbers.
Look at the sample code provided. It shows how to find all of a certain kind of tag, loop through the tags and extract the various aspects of the tags.
...
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
   # Look at the parts of a tag
   print 'TAG:',tag
   print 'URL:',tag.get('href', None)
   print 'Contents:',tag.contents[0]
   print 'Attrs:',tag.attrs
You need to adjust this code to look for span tags and pull out the text content of the span tag, convert them to integers and add them up to complete the assignment.
Sample Execution
$ python solution.py 
Enter - http://python-data.dr-chuck.net/comments_42.html
Count 50
Sum 2...

My submission
 import urllib  
 from BeautifulSoup import *  
 url = raw_input('Enter - ')  
 html = urllib.urlopen(url).read()  
 sum = 0  
 soup = BeautifulSoup(html)   
 tags = soup('span')  
 for tag in tags:  
   # Look at the parts of a tag  
   sum+=int(tag.contents[0])  
 print 'Count = ',len(tags)  
 print 'Sum = ', sum  

2 comments:

  1. import urllib.request,urllib.parse,urllib.error
    from bs4 import BeautifulSoup
    import ssl


    url = input('Enter - ')
    ht = urllib.request.urlopen(url).read()
    sum = 0
    soup = BeautifulSoup(ht,"html.parser")
    tags = soup('span')
    for tag in tags:

    sum+=int(tag.contents[0])
    print ('Count = ',len(tags) )
    print ('Sum = ', sum)

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete