Plugin: URLScraper / URLScraper1
Launch Data (ActiveX Classname): PH_URLScraper.phurlscraper  or  PH_URLScraper1.phurlscraper1
Initialization Data: Enter the full path and filename of the INI file for the URLScraper plugin. If you're using the current PH release version defaults, then this value will be: c:\powerhome\plugins\urlscrape.ini

The PowerHome URL Scraper plugins (URLScraper and URLScraper1) are designed to parse data from designated web pages, in a manner similar to the ph_regex() family of string search functions, except that the scraper plugins run in the background in their own independant thread, and take no PowerHome resources.

The preferred scraper is "URLScraper" which is the most general scraper implementation which uses the Catalyst HTTP Control functions. In rare instances where a unique website is incompatible with the Catalyst HTTP Control for some reason, the alternative URLScraper1 that uses raw socket communications is available, and will generally be successful.

The regex search that is done uses the VBScript regular expression engine (the same that is used in the new ph_regex2 functions). Full details, and good tutorials, on the protocol can be found here ...

    http://www.regular-expressions.info/quickstart.html

Using the PowerHome URLScraper Plugins:

To use the scraper the Plugin must be ...
  1) defined in PowerHome Explorer>Setup>Plugins,
  2) the default initialization file "c:\powerhome\ plugins\urlscraper.ini" needs to be modified or replaced, and
  3) a trigger event created in the PowerHome Explorer Triggers table (to call a Macro, or other Action).


Defining the urlscraper.ini file:

A definition for the URLScraper1 plugin must be entered in the PowerHome Explorer>Setup>Plugins table as follows (the 'Load Order' number doesn't have to be as exampled below. Any sequence number, is appropriate). See following discussion regarding the use of "URLScraper" or "URLScraper1".

The example image below shows both plugin versions defined for reference purposes, although normally only one would be used.

plugin definition

Configuring the urlscraper.ini file:
  • Open the "c:\powerhome\ plugins\urlscraper.ini" file in Notepad. You'll need to configure the URLScraper plugin actions by modifying that file.
  • The scraper .ini file is comprised of two main parts (see example below), one of which may be repeating depending on the number and complexity of the "scrapes."

    The 1st section is the [config] area, which has a single entry that defines how many different URL's will be defined to be searched. The urlcount parameter can be up to a maximum of 10. If you need to scrape more than 10 URL's, you can declare another instance of the scraper in the Setup Plugins table (ie, URLSCRAPER2). See also, above image.

    The 2nd section is comprised of a primary [URL_x] and one, or more, secondary sub-sections [URL_x_y].

    The Primary part defines the URL address (eg, news.google.com), how often (in minutes) that URL is to be scanned, and the number of 'scrapes' to be applied to the URL

    Since the urlcount can be 10, or less, there can thus be primary URL sections from URL_1 to URL_10.

    (There can be an unlimited number of scrapes (eg, [URL_1_113] would be possible.) The number of matches (or snaps) is determined by the number of () parenthesis snap blocks within a scrape. The examples below only show a single snap per scrape but you can have up to five (5) multiple snaps per scrape. Within a single scrape, the individual snaps will be placed in the LOCAL variables (LOCAL1 thru LOCAL5) in the same order as found.

    In addition to the LOCALs, all of the snaps within a scrape will also be placed concatenated together in the TEMP5 variable with a separator of <|> between each snap ( ie, "wind<|>15<|>" ).

    Here is a weather data extraction Example that pulls five (5) snaps off a page.


    [URL_1_1]
    regexsearch=<div id="main">[\s\S]*?<span>(.+)</span>[\s\S]*?<h4>(.+)</h4>[\s\S]*?<label>Wind:</label>[\s\S]*?<span>[\s\S]*?<span>(.+)</span>[\s\S]*?<label>Humidity:</label>[\s\S]*?<div class="b">(.+)</div>[\s\S]*?<div id="details1">[\s\S]*?<p>(.+)</p>
    regexoccur=1
    regexflags=0
    Each sub-section (as exampled in the image above) defines the regex search string, along with two parameters.

    The regexoccur defines which (if multiple) occurrences of the regex match is to be returned. If an HTML had the <td>Temperature</td>random text<b>78</b> multiple times and you wanted to return the value contained in the 3rd occurrence, then you would set the regexoccur parameter to 3. Take for example this snippet:

    <td>Temperature</td><br>Apopka: <b>78</b><br>
    <td>Temperature</td><br>Orlando: <b>79</b><br>

    A regexoccur of 1 will return the temperature for Apopka. A regexoccur of 2 will return the temperature for Orlando. The regex search string of...

    <td>Temperature</td>[\s\S]*?<b>(.+)</b>

    will match both entries but if you want the second one, you need to change the regexoccur parameter (or change the regex search itself to specifically state Orlando).

    In the weather data INI code above, the regexsearch looks for <div id="main"> then skips over everything, with the [\s\S]*? which matchs any text that appears up to the next <span> tag.

    The [\s\S] matches "\s" any "whitespace" character (ie, space, tab, line break, form feed) and the "\S" matches any non-whitespace character (ie, all characters).

    The *? makes this a non-greedy (ie, lazy) search that stops at the first match found.

    The (.+) denotes that everything between the "</span"> and "</h4>" tags be "snapped" out of the data string (the temperature in this case) for this regex search. This is a greedy search that will find the last occurance on a line. We are safe using it here because there are no data repeats before the next Return code.

    The default regexflags parameter of 0, can have a 1, or 2 (or both) added to it, where the "1" determines case sensitivity, and a "2" determines plugin trigger firing conditions, as follows ...
  • Adding 1 makes the search case sensitive, otherwise searches are case insensitive.
  • Adding 2 causes the plugin to ONLY fire the generic plugin trigger if the snapped value has changed (the plugin stores the last "snap" of each scrape and if the new "snap" is different from the last snap, then the trigger will be fired. If 2 is not added to the flags parameter, then the trigger will fire everytime even if the value is the same as the last scrape).



    Example urlscraper.ini File
    [config]
    urlcount=1
    The urlcount determines how many unique URL’s will be retrieved (NOTE: a single instance of the plugin can retrieve multiple different URL’s). For each URL in the URL count, you’ll have URL_x sections. That is, for URL count of 1, you’ll have a [URL_1] section. If you have a count of 2, you’ll have both a [URL_1} and [URL_2] section.

    [URL_1]
    url=http://m.wund.com/cgi-bin/findweather/getForecast?brand=mobile&query=32712
    freq=0.5
    scrapecount=2

    Within a [URL_X] section, you’ll have the url, the freq (the frequency in minutes for how often to retrieve the URL), and the scrapecount. The scrapecount is how many regex searches are going to be made against the retrieved URL HTML data. For [URL_1] with a scrapecount of 2, you’ll have both a [URL_1_1] and [URL_1_2] section. If you have a [URL_2] section with a scrapecount of 1, then you’ll have only a [URL_2_1] section.

    [URL_1_1]
    ;Temperature {The ";" makes this a comment line}
    regexsearch=<td>Temperature</td>[\s\S]*?<b>(.+)</b>

    The [URL_X_Y] section defines a regex search for the URL and fires a generic plugin trigger. The “X” value corresponds to the Trigger ID column (Command 1 for an X value of 1) and the “Y” value corresponds to the Trigger Value column (Option 1 for a Y value of 1).

    The search looks for the unique string of " <td>Temperature</td>" and then captures everything "[\s\S]" that is either a 'whitespace character or not a whitespace character' (ie, everything) that occurs one or more times "*" and makes the search non-greedy "?" (that is, stop at the first match found, rather than looking for more).

    regexoccur=1 {Grab the first match found (actually redundant with the ? in the search string}
    regexflags=0  {Use 0 for case insentive searchs. Add 1 for case sensetive searches, and/or add 2 for triggers only on "new" snaps.}

    [URL_1_2]
    ;Humidity
    regexsearch=<td>Humidity</td>[\s\S]*?<b>(.+)</b>
    regexoccur=1
    regexflags=0

    These two searches will find the Temperature and the Humidity from the m.wund.com web site and return to the Trigger the Temperature in LOCAL1 and the Humidity in LOCAL2. Up to 9 parameters can be captured. If more are needed a second (ie, {URL_2] section could be defined to capture additional data.

    NOTE: The [URL_x] sections are independant of each other and thus can be used to capture additional data from the same site, as indicated above, or may be directed at totally different sites with different frequencies and scrape counts for different purposes (ie, weather on one and football scores on another).

  • Configuring the Triggers:

    Create a new generic plugin trigger for each of the URL_x scrapes. You can give them ID’s of WEATHER1 and WEATHER2. Set the Trigger ID number to “Command 1" for both and the Trigger Value to “Option 1" for WEATHER1 and “Option 2" for WEATHER2.

    When the WEATHER1 trigger fires, the Temperature will be in LOCAL1. When WEATHER2 fires the Humidity will be in LOCAL1.

    If you wanted only a single trigger with temperature and humidity, you would need to have a single scrape (URL_1_1) with two snap sections. Then you could have temp in LOCAL1 (if it occurs first) and humidity in LOCAL2.

    As referenced above in the URL_1_1 description, up to 5 search matches can be captured in a single trigger. If more are needed, then a 2nd URL_x (ie, URL_2) must be defined. It can pass five more data elements.

    All Triggers for ULR_1 extractions are triggered with a "Trigger ID Number" of "Command 1", where the "Trigger Value" will be "Option 1" for the first (LOCAL1) snap, Option 2 for the 2nd (LOCAL2) snap, etc.

    URL_2 extractions will be triggered from "Command 2" triggers with "Option 1" (LOCAL1) for the 1st snap in this 2nd URL Group, and "Option 2" (LOCAL2) for the next snap, etc.

    URL Scrape Trigger



    Weather Extraction Example Using the Weather Underground API

    The following examples demonstrates an extraordinarily way to extract weather data [Special thanks to "snoker" who discovered this!]

    Weather Underground has a free "developer API" you can sign up for that will VERY quickly return a customized XML-formatted document with only the raw information needed (including full compliment of weather data combined with precipitation over time).

    You have to sign up for a free API key (be sure to pick the free option which is the developer/500 call-per-day/10 call-per-minute option - any larger volumes they sensibly want you to pay for).

    You simply register a free account with them. They send a confirmation e-mail for the account creation, and once you're in go through their "cart" (as mentioned previously at no charge to you if you pick the correct option) and check out.

    You are then issued an unlimited call API key (keep this tight for obvious reasons).

    The documentation (http://www.wunderground.com/weather/api/d/docs) is fairly straight up. You can call by LAT/LON, city location, or look up a specific station by that ID (tif you know of a maintained station nearby at a school, airport, etc.. Here is an example call to a local site using the Weather Underground demo API key that goes after a specific weather station:

    ph_geturl1( "http://api.wunderground.com/api/6f4c56e1c72e7c82/geolookup/ conditions/q/pws:KCOHIGHL5.json",1,20)

    ...and here is an example of how to call out to a city location instead:

    ph_geturl1( "http://api.wunderground.com/api/6f4c56e1c72e7c82/conditions /q/CA/San_Francisco.json",1,20)

    NOTE: You have to put your own API key in where theirs (6f4c56e1c72e7c82) is used here (I'm sure they cycle them so probably disable these on a frequent basis - these may not work directly without inserting your own key). Also, if you want to extract the HTML to examine its actual contents, you can use one of the two example method above, but it's output will be so volumenous that your will have to use a ph_write file() function to get the full results. For example...

    URL Test Code

    But you have to open the result in something like NotePad++ to see it...and it's AWESOME (excerpt below):

      . . . .
      . . . .
    "weather":"Clear",
    "temperature_string":"69.5 F (20.8 C)",
    "temp_f":69.5,
    "temp_c":20.8,
    "relative_humidity":"57%",
    "wind_string":"From the West at 1.2 MPH Gusting to 3.0 MPH",
    "wind_dir":"West",
    "wind_degrees":270,
    "wind_mph":1.2,
    "wind_gust_mph":"3.0",
    "wind_kph":1.9,
    "wind_gust_kph":"4.8",
    "pressure_mb":"1015",
      . . . .
      . . . .

    The following URLScraper ini structure will access the weather site via its API and scrape off the Temperature and wind speed data elements and pass them via the Trigger in LOCAL1 and LOCAL2 respectively. If the Trigger Action Type field is set to a Macro, that macro can then use the data passed in LOCAL1...LOCALx as apppropriate.

    NOTE: all of the scrapes could also be captured in a single URL_1_1 scrape as illustrated in the "five (5) snaps off a page" example under the "Configuring the urlscraper.ini file" heading above.

    NOTE: Setting the regexflags=2 would only fire the Trigger when the weather data actually changed from one reading to the next. If there are no changes, then there will be no attempt to fire the Trigger, thus minimizing extra CPU loading.


    [config]
    urlcount=1

    [URL_1]
    url=http://api.wunderground.com/api/7d6ed69e542ccec1/conditions/q/MA/Boston.json
    freq=15.0
    scrapecount=1

    [URL_1_1]
    regexsearch="temp_f":(.+),
    regexoccur=1
    regexflags=0

    [URL_1_2]
    regexsearch="wind_mph":(.+),
    regexoccur=1
    regexflags=0