|
Apr 12
2008
|
Imacro as an ETL Tool, or Screen Scraping on SteroidsPosted by Don in iMacro |
I'm not talking about evil, low-life scraping that involves stealing content.
Data Wants To Be Free
In this case we're looking at how to get data out of an existing system that has nothing more than a web interface. Perhaps for political reasons your IT department won't give you SQL access to the database, but you really need to pull out a copy of the product list and get it into a local MySQL database. Or perhaps you're just trying to keep track of your competition's pricing. Maybe you want to prepare a demo for a client before they love you and give you their product list.
There's absolutely no reason that data has to be locked away and keeping you from getting your job done.
ETL Can Do The Task
Or maybe not. There are a ton of commercial and free Extract, Transformation, and Load (ETL) tools out there. But they all have one thing in common: they were built for handling large movements of data between a myriad of data sources. But very few of them were built with web scraping in mind.
i
Macro Will Save You
Most people with an understanding of Imacro realize that it does a very good job of the Extract and Load processes. But out of the box it's not really very good at Transformation. Imacro is a declarative language, much like HTML. There are no constructs for logic processing such as if-then-else or applying functions to data. In fact, if you want to do anything other than some pretty generic loop processing you usually have to resort to the use of a scripting language.
There's a big problem with relying upon client-side scripting to handle your processing -- sometimes you don't have access to the client. Take our Backlink Pinger, for example. This is an application that requires no other software than the Imacro runtime to be installed on the desktop. The server generates an Imacro script for the client to run, and the client browser detects that it's a script because Content-type is set to application/octet-stream and the extension is .iim. So the user presses a button on a web page and presto they're running an Imacro script. There are lots of situations where you couldn't just install a client side application on the desktops that would need to run it. Imagine the hassle of maintaining a 1,000 desktop installation of a little Visual Basic application that pulls data from the corporate mainframe into a spreadsheet.
iMacro Transformations
So how can you perform transformations in Imacro without using the scripting host? The trick is to use a web page that you control to perform your transformations. You can extract data from one web site using Imacro, submit it to your own site, perform the transformation, then return the data in an easy to digest format for your Imacro.
Let's take a straighforward example. Our task is to write a script that can check Technorati for our current authority ranking and insert only the number into an application that tracks our authority. First, the Imacro to grab the authority:
VERSION BUILD=6120228
TAB T=1
TAB CLOSEALLOTHERS
URL GOTO=http://technorati.com/blogs/promote-my-site.com
blogreactions&&HREF:http://technorati.com/search/promote-my-site.com EXTRACT=TXT
SET !VAR1 {{!EXTRACT}}
iMacro and PHP
Running that script places "Authority: NNN" in the !VAR1 variable. But our application needs to strip out the "Authority:" from the string. If you've got access to a php server, then you could write something such as:
echo "<data>". str_replace("Authority: ","", $_GET['authority_string']) . "</data>";
Assuming we save that snippet in a file named "authtrans.php" on our server, we can add the following to our Imacro script:
URL GOTO=http://promote-my-site.com/authtrans.php?authority_string={{!VAR1}}
TAG POS=1 TYPE=DATA EXTRACT=TXT
SET !VAR2 {{!EXTRACT}}
The value of !VAR2 will now be equal to the authority number, without the troublesome prepended string.
iMacro and Javascript
That was a simple example, but the technique can be used to perform wonders. There's really no limit to what you can do inside the tranformation step. Don't have access to a PHPserver? Then just create an HTML page and use javascript to perform the transform. For example:
<input id="aus" name="authority_string" value=""
onchange="document.getElementById('aus').value = document.getElementById('aus').value.replace(/Authority:/,'');">
iMacro and Javascript - A Test
Fill in this field with "Authority: 123", tab out, and see what happens:
Just use Imacro to fill in that field with what you want transformed, then extract the contents of the field. The javascript to perform the transform will fire and you'll get the modified version back into your script
Imacro is incredibly powerful. If you're not using it, you're wasting your time.




