Written by Fernando Maciá
Why do we need to scrape sites? How to do it without knowing how to program? We will look at some tools and go deeper into one in particular. Arturo Marimón, CEO & Consultant at Arturo Marimón Consultores.
Web Scraping is to search and collect data that may be of interest to us.
At the SEO level, we use scraping for different things:
- Scrape products and URLs for a complicated migration.
- To control price changes in competitors’ products.
- To control the launching of new products of the competition.
- Continuously check changes in SEO on page: titles, description, texts…
Detect scrapers
First of all, respect the site you are scraping. You can take down a Web site’s server. You have to limit to avoid overloading the server. The systems guys don’t want scrappers on their site: they will try to block you.
They are detected by a high number of requests, repeated requests, etc. So we must hide our identity in order to get the information out.
We must change the User-agent and use proxies to hide ourselves so that the scraping is not detected.
Tools for scraping
You must know Xpath and HTML. HTML should be known by everyone working in digital. Xpath is a very simple language: it makes queries on nodes and attributes (as HTML, which is an implementation of XML, is structured).
There are different ways to extract different elements to find them in the HTML.
Scraper is a Chrome plug-in. After installing it, we can use the right mouse button to scrape an element of a page (similar Scrape) or mark any element in the DOM. It is very simple.
In addition to scraping pages, we can also scrape Google. We can make a site: domain.com and what we can do with plug-in is to make an infinite scroll of the Google results to be able to extract all the titles, URLs, etc. From a Web site.
I was going to show KimonoLabs, but they just closed it. Import.io, Apifier, parsehub, are other scraping tools. Kimonolabs allowed you to schedule periodic scrapes and alert you when content changed, and these new tools now allow you to do just that.
Zapier + Google Spreadsheet
We can use Google Spreadsheet to scrape a website. Create a new spreadsheet in Google Docs, and name the columns to be scraped: for example, title, description and headings.
Spreadsheet allows us to use importXML, and with Xpath we can tell it what we want to import. We feed the sheet with URLs and I will import these elements if I define them in each column. I extract the URLs for example by scraping from Google or Screaming Frog results.
From there, we “stretch” in each column and it fills in all the rows with the data I am interested in.
From there, what I want to do is to do it periodically. In a spreadsheet I can make it update every certain time. For example, every minute. And the next thing I need is, if the data changes, let me know. That’s where Zapier comes in.
What it allows is to automate tasks of different programs and applications. Then we can program it to notify us when there is a change in our spreadsheet. First we program it to detect the change, and the second thing we do is to tell it to send us an e-mail, for example through GMail.
The next step is to program it to keep a history by date. Instruct it to create a Google spreadsheet file and add a column with each change. Saves the changed item and the date of the change. This way we don’t have to keep track of all the mails, but I keep a history of the changes of each Web site.
This will send an email and also save the record in the Google spreadsheet. It can also be connected to Trello to trigger a notification of these changes with a card. This way we automate the whole process and we don’t have to be constantly watching.
History of changes in a site
Finally: visualping.io. For when we just need to know when the content of a site changes. This tool is used to signal content on a site, indicate what percentage of content should be changed and set up alerts.
With all this, we can keep a close eye on the competition.