Social Web Scraper
Thanks to my colleague Zhanna Khaymedinova for the idea of this exercise!
Nowadays there are tons of information on the internet. And no wonder that such information even if targeted to humans, is often collected and processed by robots.
In this task you are to write a small program which collects data over the social network. Start from here:
You see that each page represents a person with different name, date of birth and
net worth. Also each page provides
links to few other people somehow related to given one, so that from John Doe you can navigate to Dan Wagner
(via "Friends") and from here to Dave Johnson (via messages on the "Wall").
The goal is to sum up
net worth figures for all persons with specific last name (e.g. Johnson) who are reachable
(via any number of links) from John Doe.
View source of the page (by pressing
Ctrl-U or using Inspect element feature in Google Chrome or Firebug plugin
for FireFox) to see how elements of text could be distinguished (with regexps or some other method).
Typical approach is like following:
urlinto string (for example here are hints for Python);
<a href="./monika-s-smith.html">- easy to fetch with regexp!);
There are also some things to note:
1500persons and still goes on - you obviously are in a loop of some kind, since there aren't that many pages;
200-300milliseconds) after fetching each page may surprisingly reduce delays introduced by site itself and speed up the process.
Input data will contain last name (lowercased) which we are interested in.
Answer should contain total
net worth for all people with such last name who are reachable from the initial page.
input data: doe answer: 130000