Tuesday, August 14, 2012

Computer Science-Web Scraping

Web scraping, or web harvesting is the process of extracting data from the web. An example of this would be data extraction using spreadsheets, whether they be excel or through google docs. A simple example here shows the importhtml function which brings a table from the wikipedia into the spreadsheet.

The next example shows the use of the importxml function on the same page. This function is a bit more powerful as you can specify the potions through the xpath syntax. Only a portion of this is shown, but the methodology here is to go to the website you want to scrape. Then right click somewhere on that site and select view source code. Then you look for the indicator that you want to scrape, this can range from a href to tr. Scraping for a href is often gives you useful information as it will list out all the links on the page. tr is often how tables are adressed, and this is what I used here. Instead of specifying one table on the page, importxml searching for tr tags will list out every table and its entries on that page.


No comments:

Post a Comment