ph_regexsnap

Use this function for powerful text search capabilities. The regular expression special characters supported are:

.	Matches any character.
\(	This marks the start of a region for tagging a match.
\)	This marks the end of a tagged region.
\<	This matches the start of a word.
\>	This matches the end of a word.
\x	This allows you to use a character x that would otherwise have a special meaning. For example, \[ would be interpreted as [ and not as the start of a character set.
[...]	This indicates a set of characters, for example, [abc] means any of the characters a, b or c. You can also use ranges, for example [a-z] for any lower case character.
[^...]	The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character.
^	This matches the start of a line (unless used inside a set, see above).
$	This matches the end of a line.
*	This matches 0 or more times. For example, Sa*m matches Sm, Sam, Saam, Saaam and so on.
+	This matches 1 or more times. For example, Sa+m matches Sam, Saam, Saaam and so on.

An important note on the special characters is that they can conduct a "greedy" regular expression search. When using the * and + special characters this function will not stop at the first match but will instead go to the last match.

NOTE: If a string contains any quote characters (") then the string must be delimited with the single quote charcter ('). For example... 'he said, "no"'

This function will also not perform a regular expression search that spans multiple lines. If the data to search contains carraige returns or line feeds, the entire matching search data for the regular expression must exist within a single line. If your regular expression must span across a line, then add 2 to the flags to have CR's and LF's temporarily converted. CR will be converted to ASCII 128 and LF will be converted to ASCII 129. If you convert CF/LF then you can include them in your search with PowerHome escape characters ~128 and ~129 respectively. NOTE however that replacing CR/LFs will slow down searches significantly as, before the search can even be started, the entire string must be examaned character-by-character and the replacments made as needed. It is thus preferable to use the escape (~255) character, rather than (.+) and CR/LF replacement, where possible.

The ~255 approach allows you to perform multiple searchs within your search pattern (even across multiple lines). Only the last search, of multiple, can be used for Returning "snapped" data. NOTE: the ~255 character is NOT a part of the regular expression syntax and is strictly a PH special creation to separate regex searches. The $ and $ regex characters are a pair that MUST go together, so a non-regex character expression should never be inserted between them.

The regexsnap function is similar to the ph_regexdiff function in that it uses regular expressions to search and return data within the searched string. Instead of returning data between two regular expressions (like ph_regexdiff), this function uses the $ and $ to mark tagged regions. You may have up to 9 tagged regions in the last search pattern. All matched text within the tagged regions will be "snapped out" and returned by this function.

This function is often used with other PH string functions to trim a larger string, or to locate a string position within another string. See also pos(), posw(), ph_pos(), left(), mid(), right().

See also the .FAQs-String Tips-Hints Help file.

The following examples assume that the following string (with CR/LF line enders) is stored in [LOCAL1]...

ROMId,Name, Value,Avg,
"3F000001CD92C728","Refrig",39.20,37.46,
"3F6000017C8BD128","Outside",23.90,19.81,
"3F000001CDB2BA27","House",70.65,70.13,
"3F000001CD9E6D28","Freezer",1.96,1.11,

The following command ...
ph_regexsnap('ROM~255,$.+,$', '[LOCAL1]', 1, 0 ) --> finds Name, Value,Avg,
because the ".+," search is greedy and doesn't quit until it finds the last ","

While ...
ph_regexsnap('ROM~255,$.+,$', '[LOCAL1]', 1, 2 ) --> finds everything from Name to .... 6D28","Freezer",1.96,1.11,
because the ignore CR/LF Flag is set so the extraction keeps going across multiple lines until it finds the last comma.

Because of this "greediness," regsnap is thus not easily used in some situations and the regexdiff family is more useful. However, when you are assured of finding unique strings to bound the search, then regexsnap can be very powerful, while easy to use.

This ...
ph_regexsnap('ROM~255,$....$', '[LOCAL1]', 1, 0 ) --> finds Name
since the "...." accepts any 4 characters after the comma.

A good example to explain the use of ~255 is extracting weather data.

Consider the following data which may have been returned by a ph_geturl() function (the data is simplified of course):

<html>
<body>
<h4>Temperature</h4>
<b>78.5</b>
<h4>Humidity</h4>
<b>55.7</b>
<h4>Wind Speed</h4>
<b>7.5</b>
</body>
</html>

Without the ~255, this data would be difficult to parse (like retrieving the value for humidity) unless you substituted the CR/LF (which we never want to do if we can help it). Since a regex can only be performed on a single line, the best we could do is search for the second occurrence of the <b></b> tags which will work for this data but if they added a new parameter, then this would fail.

Using the ~255 we can do a regexsnap to retrieve humidity like this...

ph_regexsnap("Humidity~255<b>$.+$</b>",'[LOCAL1]',1,0) {where the ~255 causes the characters in green above to be skipped over, and the red characters to be extracted.}

The statement above will always return the value between the first <b></b> tags following the word Humidity. Of course, we have to remember the regex is greedy so we need to be careful with the ".+". If there are multiple <b></b> tags on the same line, we would get more than we expected. Since we know the humidity value will only be numbers and a decimal, we could error proof the above a little more with this:

ph_regexsnap("Humidity~255<b>$[0-9\.]*$</b>",'[LOCAL1]',1,0)

Argument	Description
pattern	String. A regular expression search pattern containing tagged regions of (up to 9) that you want snapped out and returned.
data	String. The string in which to perform the search and snap.
start	Long. The starting postion of the search. Use 1 to start at the beginning.
flags	Integer. Flags that control how the search is performed. Add individual flag values together. Add 1 to cause the search to match case. Add 2 to cause the search to ignore cr/lf's within the data.