Active TopicsActive Topics  Display List of Forum MembersMemberlist  Search The ForumSearch  HelpHelp
  RegisterRegister  LoginLogin
PowerHome Programming
 PowerHome Messageboard : PowerHome Programming
Subject Topic: urlscraper Regex Failure Post ReplyPost New Topic
Author
Message << Prev Topic | Next Topic >>
GadgetGuy
Super User
Super User
Avatar

Joined: June 01 2008
Location: United States
Online Status: Offline
Posts: 872
Posted: December 01 2019 at 15:52 | IP Logged Quote GadgetGuy

Weather Underground last week changed their standard web site where I had been scraping my weather information after they killed their API access a year ago so I just had to redesign the scrape effort.

I have readjusted my URLscraper PH Plugin ini to grab the new info format but after two days of struggle still have not gotten it to work.

Clipping the general text out of a much larger file for simplicity here we have....
Code:
class="dashboard__module__content"><lib-tile-current-c onditions _ngcontent-sc56="" _nghost-sc59=""><div _ngcontent-sc59="" class="module__container"><div _ngcontent-sc59="" class="module__header"> Current Conditions </div><div _ngcontent-sc59="" class="module__body"><!----><!----><!----& gt;<div _ngcontent-sc59="" class="ng-star-inserted"><div _ngcontent-sc59="" class="small-4 columns text-left conditions-temp"><div _ngcontent-sc59="" class="main-temp" style="color:#fd843b;"><lib-display-unit _ngcontent-sc59="" _nghost-sc17=""><!----><span _ngcontent-sc17="" class="test-true wu-unit wu-unit-temperature is-degree-visible ng-star-inserted"><!----><!----><!----> <span _ngcontent-sc17="" class="wu-value wu-value-to" style="">77.3</span><span _ngcontent-sc17="" class="wu-label"><!---->


I'm trying to extract the current temp, which is the "77.3" figure in the string above.

My Regex regular Expression to grab it is...
Code:
<lib-tile-current-conditions[\s\S]*?class="wu-value wu-value-to" style="">(.+?)</span><span


but I NEVER get an Capture fired off by the URLscraper.

If I run the search/capture expression thru Regex testers, they correctly capture the temperature.

I cannot figure out why URLScraper is NOT working. There are no other Match strings in the web page source code that match so it is not a matter of missing an earlier string.

Here is the urlscraper-WU.ini file....

[config]
urlcount=1

[URL_1]
url=https://www.wunderground.com/dashboard/pws/KFLTHEVI3?cm_ ven=localwx_pwsdash
freq=10.0
scrapecount=1

[URL_1_1]
regexsearch=<lib-tile-current-conditions[\s\S]*?class="wu-value wu-value-to" style="">(.+?)</span><span
regexoccur=1
regexflags=0

Anyone have any thoughts to share. This is driving me crazy.

PS-Is there anyway to test the PH plugins (eg, force the execution of the urlscraper-WU.ini file, or trce its return results?


Edited by GadgetGuy - December 01 2019 at 15:54


__________________
Ken B - Live every day like it's your last. Eventually, you'll get it right!
Back to Top View GadgetGuy's Profile Search for other posts by GadgetGuy
 
gg102
Senior Member
Senior Member


Joined: January 29 2013
Location: United States
Online Status: Offline
Posts: 179
Posted: December 01 2019 at 16:27 | IP Logged Quote gg102

I have tried this and been unsuccessful also.
I gave up.

I'll be following this closely!

Back to Top View gg102's Profile Search for other posts by gg102
 
GadgetGuy
Super User
Super User
Avatar

Joined: June 01 2008
Location: United States
Online Status: Offline
Posts: 872
Posted: December 02 2019 at 06:56 | IP Logged Quote GadgetGuy

It appeared the target web page may have dynamically changed from time to time thus not matching the Regex search string, so I changed to a more stable target page.

The new ini-file setup is...
Code:
[config]
urlcount=1

[URL_1]
url=https://www.wunderground.com/weather/us/fl/the-villages/ KFLTHEVI3
freq=10.0
scrapecount=1

[URL_1_1]
regexsearch=homecity-button">home</i>[\s\S]*?class="wu-value wu-value-to" style="color:[\S]*?;">(.+?)</span>
regexoccur=1
regexflags=0


and a surrounding portion of the searched string is...
Code:
>81.95</strong> W </span><h1 _ngcontent-sc18=""><span _ngcontent-sc18="">Lady Lake, FL Weather Conditions</span><span _ngcontent-sc18="" class="icons"><i _ngcontent-sc18="" class="material-icons favorite-star">star_rate</i><i _ngcontent-sc18="" class="material-icons homecity-button">home</i></span></h1>&l t;div _ngcontent-sc18="" class="station-nav"><img _ngcontent-sc18="" alt="icon" class="station-condition" src="//www.wunderground.com/static/i/c/v4/33.svg"><a _ngcontent-sc18="" class="station-name"><lib-display-unit _ngcontent-sc18="" type="temperature" _nghost-sc13=""><!----><span _ngcontent-sc13="" class="test-true wu-unit wu-unit-temperature is-degree-visible ng-star-inserted"><!----><!----><!----> <span _ngcontent-sc13="" class="wu-value wu-value-to" style="color:#fd843b;">73</span> <span _ngcontent-sc13="" class="wu-label"><!----><span _ngcontent-sc13="" class="ng-star-inserted">F</span><!----></ span><!----></span><!----></lib-disp lay-unit> La Zamora Station</a><span _ngcontent-sc18="" id="report-link"


The target web site is now....
Code:
https://www.wunderground.com/weather/us/fl/the-villages/KFLT HEVI3


But I still get no Regex extraction.

Dave- It would REALLY be handy if the urlscraper plugin posted a trigger event and the capture (LOCAL1-LOCAL9) results in the Event Log !!!!!

__________________
Ken B - Live every day like it's your last. Eventually, you'll get it right!
Back to Top View GadgetGuy's Profile Search for other posts by GadgetGuy
 
dhoward
Admin Group
Admin Group
Avatar

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4264
Posted: December 02 2019 at 20:47 | IP Logged Quote dhoward

Ive been using the mobile site of Weather Underground (https://www.wunderground.com/cgi-bin/findweather/getForecas t?brand=mobile&query=32712) but my own
scraper for this has stopped working as well. Now calling this URL returns a 301 Permanently moved. It would appear that WU is purposefully trying make
it difficult for people to scrape their site.

One thing that you can do with the scraper plugin is to create a trigger for trigger number 256 with a trigger value of Any. This trigger is called
whenever the plugin encounters an error and may help with troubleshooting.

I played around with the URL above and this seems to be working well for me to capture the temperature:

Code:
wu-unit-temperature is-degree-visible[\S\s]*?wu-value-to[\S\s]*?>([0-9]*?)<


The page seems to dynamically change but after multiple attempts, the above seems to work YMMV. Im thinking it might be best to start finding an
alternative site.

Dave.
Back to Top View dhoward's Profile Search for other posts by dhoward Visit dhoward's Homepage
 
TonyNo
Moderator Group
Moderator Group
Avatar

Joined: December 05 2001
Location: United States
Online Status: Offline
Posts: 2867
Posted: December 02 2019 at 21:38 | IP Logged Quote TonyNo

Damn weather.com!
Back to Top View TonyNo's Profile Search for other posts by TonyNo Visit TonyNo's Homepage
 
GadgetGuy
Super User
Super User
Avatar

Joined: June 01 2008
Location: United States
Online Status: Offline
Posts: 872
Posted: December 03 2019 at 07:07 | IP Logged Quote GadgetGuy

Thanks for jumping in here Dave. I don't think I
would have discovered the failure reason without an
awful lot more work. That sure explains the problem.

WU has been nibbling away at isolating their service
for some time now. Last year they killed the XML
access which gave nice clean data, and now the general
site. Bummer.

Fortunately I had another path which I fired up and
should have been doing anyway. I have a Temp Stick
(from https://tempstick.com) which very accurately
captures both temp and humidity and sends it into the
Cloud where it can be readily scraped. That is now
working even better (and more accurately) that the WU
data as the Stick is attached right outside my door.


All that aside, what is the chancce of having the
Plugin Triggers post some info in the Event Log when
they fire? Sure would help debugging!!!


Edited by GadgetGuy - December 03 2019 at 07:07


__________________
Ken B - Live every day like it's your last. Eventually, you'll get it right!
Back to Top View GadgetGuy's Profile Search for other posts by GadgetGuy
 

If you wish to post a reply to this topic you must first login
If you are not already registered you must first register

  Post ReplyPost New Topic
Printable version Printable version

Forum Jump
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum