Friday, August 17, 2012

Build Your Own URL Scraper and Scale Your CPV Campaigns Sky High

This is extremely well done guest post by Tijn of IMRat.com (Internet Marketing Rat). While not for the faint of heart, the technical detail here is extremely valuable if you’re willing to take the time to intake and digest it all, even if that means coming back later to read it in chunks. So without further ado, check it out…
Don’t you hate it when you spend days testing URLs for your new CPV or Google Display Network campaign, only to find that the profitable URLs give you just enough traffic and profit to have a couple of pints (as in beer)?
$20 a day in profit is nice, but it won’t last long. Especially if you only have 2 or 3 profitable targets. You are destined to fail very quickly. As soon as another affiliate targets the same URLs you are doomed.
Some will urge you to move on and find another offer.
Not me.
Save that $100 you would have spend testing 10 new offers. Focus first on scaling and maximising the profits from the offer that you already know converts!
All you need to do is find a bunch of related URLs and test those.
But how? You don’t really want to spend hours checking a variety of websites, copy & pasting URLs. You could outsource this boring brain numbing task…
Nah! Follow my motto…

Outsource Only That What Can’t Be Automated!

So in this post I will share with you one of the many ways in which I automate my CPV workflow.
You will learn how you can build your own automated URL scraper. The methods I share will save you hours of time! No more manual copying and pasting of results from the search engines.

The Tools

There are a number of websites that you will need to use to build your scraper. I suggest you register if you don’t have an account:
Yahoo Pipes
Dapper

The 8 Sources Of Related URL Data

Using the tools mentioned above, you will setup your scraper to get related URL data. Here are 8 sites I use to get related URLs and keywords to scale my campaigns.
Google – Similar Sites
Google Trends – Sites Also Visited
Alexa – Top 5 High Impact Search Queries, Related Links, and Clickstream
iSpionage – SEO Competitors
Google Ad Planner – Sites Also Visited
Compete – Destinations
Compete – Top 5 Referrer Keywords
Open Site Explorer – High authority backlinks
And yes – I use these, as well as a bunch of other hush hush techniques in my own private scraper. I am not gonna walk you through scraping each of these, that would be too easy.
I’ll show you how to do 3. You can figure out the others for yourself. After all, the aim here is for you to learn something….right? ;)

How will the URL scraper work?

Using free online applications, tied together with some secret sauce, the tool will take a profitable URL as input, and spit out a bunch of other URLs for you to test in your CPV or Google Display Network campaign.
For this tutorial, the URLs will come from 3 sources:
● Google Related Sites
● Google Trends
● High Page Rank Backlink URLs
Then, all you need to do is add the new URL targets to your campaign, and see which ones convert. Take the URLs that are converting, and rinse and repeat.
So lets get started with Stage 1, setting up Yahoo Pipes.

Stage 1: Create Your Master Yahoo Pipe

Yahoo Pipes is the control centre for your scraper. The reason for this is that it’s got loads of modules for you to manipulate, clean and extract data.
At the moment Yahoo is still using the V1 Engine as the default, and this tutorial is focused on that version. They are working on Version 2 and you can if you want test your pipes in the V2 engine, but this is still in beta.
Check out one of my earlier posts about scraping google hot trends with yahoo pipes for an introduction.
Once you have checked out the post above, go over to Yahoo Pipes, register/login, and click the Create A Pipe link.
This will take you to the main editor screen. Before you start editing, you should name your pipe by clicking the ‘Untitled’ tab in the top right hand corner, enter a name for your pipe. Then click OK and click the save button on the top-right of the screen.

If you want you can also use the ‘Properties’ button to enter the title, a description and a bunch of tags to help you locate your pipes easier in future.

Stage 2: Add the URL Input Module

Next, you are going to create the module for inputting the URL you want to expand.
For this you just drag the User inputs > URL Input module into the Pipe Editor. Enter the name and prompt, and make sure you enter 1 in the Position field. When building pipes I also add default and debug info to help with testing modules.

For this tutorial I focus on a scraper that is primarily targeted at expanding domains, not URLs for specific pages. Don’t worry though, you can use most of the techniques I share on page urls as well.
Therefore as a next step, you want to make sure that the domain is extracted from whatever URL is input. This is done with a simple String > String Regex module which takes the URL string and uses a regex code to extract the domain.
Make sure that the output from the URL input module is connected to the input of this new module, and enter the fields as they are shown below:

Stage 3: Get Google Similar Sites

For this first bunch of URLs you will use a feature from google which allows you to identify sites that are similar to a particular URL.

Unfortunately, because Google use the robot.txt file to block some scrapers, you won’t be able to use Yahoo Pipes.
No problem though, because the excellent Dapper comes to the rescue. For an introduction to Dapper, watch this video. Dapper is basically a simplified version of the Fetch Page module in Yahoo Pipes.

Setup your Dapper

If you are not familiar with how to use Dapper to scrape data from Google, I suggest you read my blogpost about Scraping Google Instant.
Go over to open.dapper.net, and create your google dapper, and set it up for the following search query:
related:jonathanvolk.com
You can, if you want to return more than 10 results per query, add the following to the google url:
&num=100
Once your done with your Dapp and got it setup, you need to get the dapp URL. Just click the XML button:

And enter a dummy URL in the URL field (in the image below I entered http://jonathanvolk.com):

Click the Update Input button and copy the URL starting with http://open.dapper.net/RunDapp… into your notebook because you will need that URL in the next step.

Build the URL in Yahoo Pipes

Switch back to your Yahoo Pipes browser window and drag a URL > URL Builder module into the editor. Then paste the Dapper XML Url from the previous step into the field next to the label ‘Base:’ and press the key. This should split the URL and populate the Query Parameters as shown in the screenshot below.
Connect the output from the String Regex module in Step 2 with the input to the right of the variable input field for v_url as shown above. This will bascially add the domain to the v_url variable and pass this to the Dapp you created just now.

To make sure that the URL is correct, click on the title bar so that the module box is highlighted in orange. Then click on the refresh link in the bottom left hand corner (the debug part of the screen with the gray background).

If everything has been connected and setup correctly you should see the URL as per the above screenshot.

Load the XML Dapper Data

Now that you have got the URL setup to your created Dapp, all you need to do is setup a Get Data module which will load the xml data from the URL.
So drag a Sources > Fetch Data module onto the Pipe Editor and connect the output from the URL Builder module in the previous step to the URL field input of this new module as shown below:

In the Path to item list you want to enter url, or whatever other name you assigned to the Dapp variable.
Again to test make sure the module is selected and click the refresh link at the bottom of the page. This should output a list of items numbered 0 to 7, and if you expand an item, it should look like the following:

Because this contains an amount of data you are not interested in, you just need to add the Operators > Sub-element module to strip out any data you dont need.
Link this with the output from the previous module and you are done and can move on to the next stage.

Stage 4: Google Website Trends

Next up, the 2nd source of related URLs: Google Website Trends site & the list of “Also visited” sites.

Here you are just going to use the Also visited site URL’s. If your looking for ways to expand your scraper, you could use the list of ‘also searched for’ keywords and plug them into a google serp scraper.
To scrape these URLs you can use Yahoo Pipes. For some reason this is not blocked by Google.
Like with previous examples, you create a URL Builder module, and pipe that into a Fetch page module. The fetch page module is then used to cut out the relevant html and break it up into a list of items, one for each URL.
To start, view the webpage source and search for “Also visited” to identify the relevant section:

You also need to know where the URL list ends, so scroll down the Source until you get to the point where you can see the heading “Also searched for”:

For this one I’m not walking you through the step by step. Just use what you learned in stage 1 above and my post about scraping google hot trends with yahoo pipes.
Create another URL Builder module, and copy the full URL of the Google Trends search into the URL field. When you press Tab it should automatically populate the different fields. Then you connect the String Regex module from Step 1 to the string input for the ‘q’ field.
Next create your Fetch Page module and attach the URL Builder pipe to this new module. Then, with a bit of trial and error, enter the relevant HTML code to cut out & separate the html for the links. Don’t worry about being too precise, because you will setup a ‘search & replace’ module to remove any unnecessary data.
When you’re done with this step, your Master pipe should look something like this:

This module will basically give you a list of items with html code for each URL. Because we really just need the root domain for each of the sites, you will need to clean up each item using a Regex module. The bit you want to extract is underlined in Red.

The tool of choice for this is the Operator > Regex module which uses Regular Expressions to search & replace for sections that we don’t need in each item.
Again, using my favourite trial and error method, setup the regex rules so that at the end you have only the domain left as content. Rather than doing this all in Yahoo Pipes, I use the online RegEx tool with the relevant html copied into the content section to sort out my Regex rules.
Note that you will need to check each item to make sure your rules work for each item and extract the domain correctly.

Next, the final source of related URLs – Top backlinks!

Stage 5: Top Backlinks Through Open Site Explorer

One of my strategies for scaling profitable URLs is to look where the traffic to that URL is coming from. There are many different sources of traffic, and here you are after the websites that linkto the URL you want to scale.
Open Site Explorer is a great tool to capture these backlinks:
● The index of backlinks is as good as Yahoo Site explorer
● It offers API and CSV access to the data (makes scraping easier)
● It ranks sites by Page & Domain authority – ie most important backlinks first

This final part of the URL scraper uses again the URL builder module, but instead of the Fetch Page module you will need to add the Fetch CSV module and use the CSV link (highlighted on the screenshot above):
http://www.opensiteexplorer.org/www.jonathanvolk.com/a!links!!f!all!!format!csv!!s!external!!t!page
Because this is not a standard formatted URL, you need to manually paste the relevant sections in the URL builder fields. You can create additional Path elements by clicking the + sign:
Base: http://www.opensiteexplorer.org
1st Path Element: the query URL, ie www.jonathanvolk.com
2nd Page Element: a!links!!f!all!!format!csv!!s!external!!t!page
Next you will need to make sure the URL you want to scale is passed to the new URL builder module. Before you connect it however, there is a small problem.
In this particular case, Open Site Explorer does not cope with URLs formatted like this: http://jonathanvolk.com. You therefore need to make sure that the input URL is reformatted and that the http:// is stripped from the URL.
Just use a String Replace module and replace the first occurrence of http:// with nothing.

Next create the Fetch CSV module and connect it to the URL builder as shown below:

To configure the Fetch CSV module correctly, I recommend you download the actual CSV file and open it in Excel or similar spreadsheet program.
That way you can easily identify the number of rows that need to be skipped before the data starts, and name each of the columns as I have done above.
For this particular scraper you are only interested in the target URL, so like with the Google Serp scraper step above, you need to remove all other data:

Thats the final one done.
Next you will combine all data sources together and clean up the URLs.

Stage 6: Tidy up your URL list

Before I walk you through how to finalise & use your URL scraper, a quick overview of what we have covered so far.
Based on an input URL your CPV Campaign Scaler creates 3 lists of related URLs:
1. Google SERPs – Related Sites
2. Google Trends – Sites Users Also Visited
3. Open Site Explorer Backlinks
These lists now need to be combined, cleaned up, and you want to remove duplicates.
To combine different Item lists, just add an Operator > Union module and connect the outputs for each of the final modules on the 3 data sources to the Union module:

By clicking on the union module you should now have a single list of items showing in the Debugger window. The total of this list of items should equal the sum of the output from the 3 data sources.
You might have noticed that not all URLs are formatted in the same way. Before you can remove duplicates you therefore need to ensure all URLs are formatted in the right way. Because this is for scaling PPV campaigns, the format you will use is domain.com/path/page.ext.
So again, add an Operator > Regex module and configure it to remove unnecessary content from each item:
● Remove http:// from the start of any URLs
● Remove trailing / from the end of any URLs
See the screenshot below for the RegEx rules you need for this.
Finally, lets remove duplicate URLs. Just connect a Operator > Unique module to the RegEx ouput and configure it to filter based on item.content:

In this particular example 221 URLs were reduced to 218, so there were only 3 duplicates.
Save your module again, and in the next and final stage you will get your Yahoo Pipe ready to publish the URLs and extract it as a CSV file so you can easily copy and paste the URLs into your Traffic source of choice.

Stage 7: Launch your scraper

To be able to export your yahoo pipe as a CSV file, you will need to add & rename a couple of fields:
● Change the item.content field to item.title
● Add a y:id field
This is straight forward, but not very well documented on Yahoo Pipes.
Add a Operator > Rename module and connect the Output from the Unique module above to the input of this Rename module. All you need is 2 rename rules:

Before you save and run your pipe, you need to connect the output from the Rename module to the Pipe Output module.
This module is always included on your pipe so you wont need to add it.
However, because the pipe is rather complex and large, you might have trouble locating the Pipe Output. If this is the case, just click the “layout” button in the top left corner of the screen, and scroll to the top right hand corner of your pipe layout window.
You will need to drag and drop the Pipe Output module several times until your at the bottom of the Pipe Layout before you can connect the output from the Rename module to the input of the Pipe Output module.
Now hit the Save button at the top of the screen, and then click Run Pipe.
This will open a new Tab/Window and start to run the Pipe. Once its completed running, you should see something like this:

To get the CSV file, you need to copy the URL from the “Get as RSS” button. Just rightclick the button and click “Copy link address” or “Open in New Window” in Google Chrome or something similar in other browsers.
You should have a URL that looks something like this:

Then just change the _render setting to CSV and open the URL in your browser. This should popup a “Save file as…” window. Before you save change the filename to something like jonathanvolk.csv.
And thats it.
Click on the image below to open up the full yahoo pipe.


Now there are many ways in which you can build on this:
● Add data sources
● Scrape Alexa rank Open Site Explorer Page Authority to sort your list to get the most important ones on top
● Include URL variations (ie domain.com/ domaincom .domain.com)
If you liked this tutorial or have any questions, drop me a comment below. If you want more then check out my blog imrat.com & subscribe to my twitter feed.

No comments:

Post a Comment