Active TopicsActive Topics  Display List of Forum MembersMemberlist  Search The ForumSearch  HelpHelp
  RegisterRegister  LoginLogin
PowerHome Programming
 PowerHome Messageboard : PowerHome Programming
Subject Topic: urlscraper Regex Failure Post ReplyPost New Topic
Author
Message << Prev Topic | Next Topic >>
GadgetGuy
Super User
Super User
Avatar

Joined: June 01 2008
Location: United States
Online Status: Offline
Posts: 874
Posted: December 01 2019 at 15:52 | IP Logged Quote GadgetGuy

Weather Underground last week changed their standard web site where I had been scraping my weather information after they killed their API access a year ago so I just had to redesign the scrape effort.

I have readjusted my URLscraper PH Plugin ini to grab the new info format but after two days of struggle still have not gotten it to work.

Clipping the general text out of a much larger file for simplicity here we have....
Code:
class="dashboard__module__content"><lib-tile-current-c onditions _ngcontent-sc56="" _nghost-sc59=""><div _ngcontent-sc59="" class="module__container"><div _ngcontent-sc59="" class="module__header"> Current Conditions </div><div _ngcontent-sc59="" class="module__body"><!----><!----><!----& gt;<div _ngcontent-sc59="" class="ng-star-inserted"><div _ngcontent-sc59="" class="small-4 columns text-left conditions-temp"><div _ngcontent-sc59="" class="main-temp" style="color:#fd843b;"><lib-display-unit _ngcontent-sc59="" _nghost-sc17=""><!----><span _ngcontent-sc17="" class="test-true wu-unit wu-unit-temperature is-degree-visible ng-star-inserted"><!----><!----><!----> <span _ngcontent-sc17="" class="wu-value wu-value-to" style="">77.3</span><span _ngcontent-sc17="" class="wu-label"><!---->


I'm trying to extract the current temp, which is the "77.3" figure in the string above.

My Regex regular Expression to grab it is...
Code:
<lib-tile-current-conditions[\s\S]*?class="wu-value wu-value-to" style="">(.+?)</span><span


but I NEVER get an Capture fired off by the URLscraper.

If I run the search/capture expression thru Regex testers, they correctly capture the temperature.

I cannot figure out why URLScraper is NOT working. There are no other Match strings in the web page source code that match so it is not a matter of missing an earlier string.

Here is the urlscraper-WU.ini file....

[config]
urlcount=1

[URL_1]
url=https://www.wunderground.com/dashboard/pws/KFLTHEVI3?cm_ ven=localwx_pwsdash
freq=10.0
scrapecount=1

[URL_1_1]
regexsearch=<lib-tile-current-conditions[\s\S]*?class="wu-value wu-value-to" style="">(.+?)</span><span
regexoccur=1
regexflags=0

Anyone have any thoughts to share. This is driving me crazy.

PS-Is there anyway to test the PH plugins (eg, force the execution of the urlscraper-WU.ini file, or trce its return results?


Edited by GadgetGuy - December 01 2019 at 15:54


__________________
Ken B - Live every day like it's your last. Eventually, you'll get it right!
Back to Top View GadgetGuy's Profile Search for other posts by GadgetGuy
 
gg102
Senior Member
Senior Member


Joined: January 29 2013
Location: United States
Online Status: Offline
Posts: 206
Posted: December 01 2019 at 16:27 | IP Logged Quote gg102

I have tried this and been unsuccessful also.
I gave up.

I'll be following this closely!

Back to Top View gg102's Profile Search for other posts by gg102
 
GadgetGuy
Super User
Super User
Avatar

Joined: June 01 2008
Location: United States
Online Status: Offline
Posts: 874
Posted: December 02 2019 at 06:56 | IP Logged Quote GadgetGuy

It appeared the target web page may have dynamically changed from time to time thus not matching the Regex search string, so I changed to a more stable target page.

The new ini-file setup is...
Code:
[config]
urlcount=1

[URL_1]
url=https://www.wunderground.com/weather/us/fl/the-villages/ KFLTHEVI3
freq=10.0
scrapecount=1

[URL_1_1]
regexsearch=homecity-button">home</i>[\s\S]*?class="wu-value wu-value-to" style="color:[\S]*?;">(.+?)</span>
regexoccur=1
regexflags=0


and a surrounding portion of the searched string is...
Code:
>81.95</strong> W </span><h1 _ngcontent-sc18=""><span _ngcontent-sc18="">Lady Lake, FL Weather Conditions</span><span _ngcontent-sc18="" class="icons"><i _ngcontent-sc18="" class="material-icons favorite-star">star_rate</i><i _ngcontent-sc18="" class="material-icons homecity-button">home</i></span></h1>&l t;div _ngcontent-sc18="" class="station-nav"><img _ngcontent-sc18="" alt="icon" class="station-condition" src="//www.wunderground.com/static/i/c/v4/33.svg"><a _ngcontent-sc18="" class="station-name"><lib-display-unit _ngcontent-sc18="" type="temperature" _nghost-sc13=""><!----><span _ngcontent-sc13="" class="test-true wu-unit wu-unit-temperature is-degree-visible ng-star-inserted"><!----><!----><!----> <span _ngcontent-sc13="" class="wu-value wu-value-to" style="color:#fd843b;">73</span> <span _ngcontent-sc13="" class="wu-label"><!----><span _ngcontent-sc13="" class="ng-star-inserted">F</span><!----></ span><!----></span><!----></lib-disp lay-unit> La Zamora Station</a><span _ngcontent-sc18="" id="report-link"


The target web site is now....
Code:
https://www.wunderground.com/weather/us/fl/the-villages/KFLT HEVI3


But I still get no Regex extraction.

Dave- It would REALLY be handy if the urlscraper plugin posted a trigger event and the capture (LOCAL1-LOCAL9) results in the Event Log !!!!!

__________________
Ken B - Live every day like it's your last. Eventually, you'll get it right!
Back to Top View GadgetGuy's Profile Search for other posts by GadgetGuy
 
dhoward
Admin Group
Admin Group
Avatar

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4335
Posted: December 02 2019 at 20:47 | IP Logged Quote dhoward

Ive been using the mobile site of Weather Underground (https://www.wunderground.com/cgi-bin/findweather/getForecas t?brand=mobile&query=32712) but my own
scraper for this has stopped working as well. Now calling this URL returns a 301 Permanently moved. It would appear that WU is purposefully trying make
it difficult for people to scrape their site.

One thing that you can do with the scraper plugin is to create a trigger for trigger number 256 with a trigger value of Any. This trigger is called
whenever the plugin encounters an error and may help with troubleshooting.

I played around with the URL above and this seems to be working well for me to capture the temperature:

Code:
wu-unit-temperature is-degree-visible[\S\s]*?wu-value-to[\S\s]*?>([0-9]*?)<


The page seems to dynamically change but after multiple attempts, the above seems to work YMMV. Im thinking it might be best to start finding an
alternative site.

Dave.
Back to Top View dhoward's Profile Search for other posts by dhoward Visit dhoward's Homepage
 
TonyNo
Moderator Group
Moderator Group
Avatar

Joined: December 05 2001
Location: United States
Online Status: Offline
Posts: 2871
Posted: December 02 2019 at 21:38 | IP Logged Quote TonyNo

Damn weather.com!
Back to Top View TonyNo's Profile Search for other posts by TonyNo Visit TonyNo's Homepage
 
GadgetGuy
Super User
Super User
Avatar

Joined: June 01 2008
Location: United States
Online Status: Offline
Posts: 874
Posted: December 03 2019 at 07:07 | IP Logged Quote GadgetGuy

Thanks for jumping in here Dave. I don't think I
would have discovered the failure reason without an
awful lot more work. That sure explains the problem.

WU has been nibbling away at isolating their service
for some time now. Last year they killed the XML
access which gave nice clean data, and now the general
site. Bummer.

Fortunately I had another path which I fired up and
should have been doing anyway. I have a Temp Stick
(from https://tempstick.com) which very accurately
captures both temp and humidity and sends it into the
Cloud where it can be readily scraped. That is now
working even better (and more accurately) that the WU
data as the Stick is attached right outside my door.


All that aside, what is the chancce of having the
Plugin Triggers post some info in the Event Log when
they fire? Sure would help debugging!!!


Edited by GadgetGuy - December 03 2019 at 07:07


__________________
Ken B - Live every day like it's your last. Eventually, you'll get it right!
Back to Top View GadgetGuy's Profile Search for other posts by GadgetGuy
 
Handman
Senior Member
Senior Member


Joined: February 02 2009
Location: United States
Online Status: Offline
Posts: 205
Posted: September 20 2020 at 14:12 | IP Logged Quote Handman

I installed PH on a new PC running W10 last month and just realized that this url scraper to harvest weather conditions from my WeatherUnderground API is toast. I tried inserting the public URL, per Ken's suggestion in another thread, but now I get another error:

An error occurred while processing the execution queue. Resetting execution queue and trying again.
*** Error Details ***
Error Number: 39
Object Name: uo_socketblob
Class: uo_socketblob
Routine Name: f_geturl
Line: 51
Text: Error accessing external object property remoteport at line 51 in function f_geturl of object uo_socketblob.

Per one of Dave's suggestions I ran as administrator regsvr32 "c:\windows\syswow64\cswskax6.ocx") with no luck. Then I successfully ran regsvr32 "c:\windows\syswow64\cswskax8.ocx") on the later version of that file, but I am still getting the same error from the url scraper function.

Any ideas what is going on or how I can start updating meteorological data into PH again?
Back to Top View Handman's Profile Search for other posts by Handman
 
dhoward
Admin Group
Admin Group
Avatar

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4335
Posted: September 20 2020 at 19:57 | IP Logged Quote dhoward

Handman,

What version of PowerHome are you running? Also, if you can post the actual formula that you're using, I can get a better idea of what is going on. Im
guessing you're running either the ph_geturl or ph_geturl1 functions. If you're running the ph_geturl1 function then the type number that you're using
controls which mechanism that the function will use to access the URL which might not even be the cswskax8.ocx file.

Let me know and we'll at least get the URL function working. Might have to work up something for weather scraping since weatherunderground made things so
difficult.

Dave.
Back to Top View dhoward's Profile Search for other posts by dhoward Visit dhoward's Homepage
 
Handman
Senior Member
Senior Member


Joined: February 02 2009
Location: United States
Online Status: Offline
Posts: 205
Posted: September 20 2020 at 22:49 | IP Logged Quote Handman

I'm running 2.15c since it's worked for years. The formula is part of a Get_Weather macro that I think Beach Bum posted in the forums many moons ago. The actual formula in the macro is ph_geturl. (e.g., ph_geturl ("https://www.wunderground.com/weather/us/ca/arcata/95521")

Edited by Handman - September 20 2020 at 22:53
Back to Top View Handman's Profile Search for other posts by Handman
 
dhoward
Admin Group
Admin Group
Avatar

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4335
Posted: September 21 2020 at 21:50 | IP Logged Quote dhoward

Handman,

Sorry to take so long responding. I had to dig through the sourcecode history to look at the relevant code that was in use for 2.1.5c.

I don't believe it's a problem with your socket control being registered since you've done that and reported success. Additionally, there are 5 other
commands to the socket control prior to the one for the remote port that don't have a problem.

My guess would be that it's a firewall issue since the error occurs during the setting of the remoteport property which is when a firewall rule would
typically be triggered. I assume you're using the built in Windows 10 firewall so you may want to add an exception for pwrhome.exe.

Another thing I would try is to launch pwrhome.exe with "Run as administrator".

Last, in your sample URL, you've got "https://www....". The ph_geturl function in version 2.1.5c did not support making calls to secure websites. If you
are actually trying to call an HTTPS site then the function will ultimately fail. I don't believe that is what is causing the current issue as you arent
getting that far yet.

Hope this helps some. Since you're on version 2.1.5c, you may want to look into moving to 2.1.5e which is an easy transition (version 2.1.5c is the minimum
version you need to upgrade to 2.1.5e).

Dave.
Back to Top View dhoward's Profile Search for other posts by dhoward Visit dhoward's Homepage
 
Handman
Senior Member
Senior Member


Joined: February 02 2009
Location: United States
Online Status: Offline
Posts: 205
Posted: September 21 2020 at 22:22 | IP Logged Quote Handman

I don't think it was the firewall. There was an exception for pwrhome.exe and I do run it as administrator. I dropped the "S" now it runs! Looks like that's all it was. Unfortunately the macro didn't scrape any useful meteorological data from the public WeatherUndergound site I used (the API system worked flawlessly until WU canned it). Do you use a better site?

I will upgrade to 2.1.5e. It might fix some latency issues I'm having. Strangely enough, while looking through the forums I started seeing topics that seemed right on point and discovered it was MY OWN ISSUE when I last upgraded to Win7!! I wish there had been a silver bullet, but it did resolve on its own then, so I am hoping for some of that karma again.

Thanks again for you help, Dave. I'll probably jump to 2.2 as soon as I get this new system (which I am running remotely) to be stable.

By the way, this is the formula evaluation to the get_url command:



Formula Evaluation
     Execution time: 0.141 seconds.
     The formula evaluates to: HTTP/1.0 301 Moved Permanently
Server: AkamaiGHost
Content-Length: 0
Location: https://www.wunderground.com/weather/us/ca/arcata/40.89,-124 .09
Cache-Control: max-age=0
Expires: Tue, 22 Sep 2020 03:22:06 GMT
Date: Tue, 22 Sep 2020 03:22:06 GMT
Connection: close
Set-Cookie: speedpin=4G; expires=Tue, 22-Sep-2020 03:36:51 GMT; path=/; domain=.wunderground.com; secure
Set-Cookie: ci=TWC-Connection-Speed=4G&TWC-Locale-Group=US&TWC-Device-Cl ass=desktop&X-Origin-Hint=wu-next-prod&TWC-Network-Type=wifi &TWC-GeoIP-Country=US&TWC-GeoIP-Lat=36.5816&TWC-GeoIP-Long=- 121.8436&Akamai-Connection-Speed=1000+&TWC-Privacy=usa-ccpa; path=/; domain=.wunderground.com; secure
Property-id: TWC-WU-Prod
TWC-Privacy: usa-ccpa
TWC-GeoIP-LatLong: 36.5816,-121.8436
TWC-GeoIP-Country: US
TWC-Device-Class: desktop
TWC-Locale-Group: US
TWC-Connection-Speed: 4G
X-Origin-Hint: wu-next-prod





Edited by Handman - September 21 2020 at 22:26
Back to Top View Handman's Profile Search for other posts by Handman
 
Handman
Senior Member
Senior Member


Joined: February 02 2009
Location: United States
Online Status: Offline
Posts: 205
Posted: September 21 2020 at 22:44 | IP Logged Quote Handman

After playing with different weather sites it seems that they are all SSL certificate/secure sites. As you noted, the function won't work. When I substitute http, the formula returns an error. Why are none of these sites non-certificate urls? I really hadn't noticed it before, but my search engines don't ever bring up a non-https site at all!
UPDATE: After upgrading to PH 2.1.5e, the function is now apparently working with secure websites, but unfortunately the scraper isn't extracting useful data for the weather macro to update global variables in PH. I guess that leaves me in good company with other users on this forum.   

Edited by Handman - September 22 2020 at 11:34
Back to Top View Handman's Profile Search for other posts by Handman
 
dhoward
Admin Group
Admin Group
Avatar

Joined: June 29 2001
Location: United States
Online Status: Offline
Posts: 4335
Posted: September 26 2020 at 22:10 | IP Logged Quote dhoward

Handman,

Couldnt understand how that could happen so finally got a chance to really go over the 2.1.5c code in detail. I see the problem now. Since that version of
PowerHome didnt support HTTPS, I never checked for https:// at the beginning of the URL to remove it (I only checked for http://). Since I didnt remove it, the
next step in parsing the URL was to search for a ":" which would be proceeding a port number. In this case, the program would be using the : in https and the
data immediately following which ultimately would result in a port number of NULL. Makes perfect sense now.

Anyways, glad to see you upgraded to 2.1.5e. It definitely has the support for HTTPS but as you've found, hard to get good weather data. I need to rework my own
weather scraper so once I find a good site and come up with the regular expressions to reliably extract, I'll post the info here for all to use.

Dave.
Back to Top View dhoward's Profile Search for other posts by dhoward Visit dhoward's Homepage
 

If you wish to post a reply to this topic you must first login
If you are not already registered you must first register

  Post ReplyPost New Topic
Printable version Printable version

Forum Jump
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum