Patterns tutorial

Jmv38 · September 3, 2014, 1:05pm

For codea showcase, i have to decode a little bit of html. I’ve decided to switch to patterns. Here are my first patterns explained. Since i am a noob in patterns, it may help some of you to jump into them more quickly.


function setup()

    str = "12 add  FGv"
    for s in charSplit(str," ") do print(s) end
    
    str = "12|add|FGv"
    for s in charSplit(str,"|") do print(s) end

    somePatterns()
end

function charSplit(str,c)
    str = str .. c
    return string.gmatch(str,"(.-)"..c.."+") 
end

function somePatterns()

-- some patterns explained:
    
    local pattern = '(%b<>)' -- finds all <html tags>
-- pattern explained step by step:
--  '%b<>'  a substring delimited by '<' and '>'
-- '(%b<>)' capture a substring delimited by '<' and '>', and include delimiters
    local data = "<xxx> yyy </xxx>"
    print("pattern: '"..pattern .. "'" )
    print("input: '" .. data .. "'")
    print("result:")
    for tag in string.gmatch(data, pattern) do
        print("'" .. tag .. "'" )
    end
    
    local pattern = '<xxx>(.-)</xxx>' -- finds content of one html tag xxx
--  pattern explained step by step:
-- '<xxx>.</xxx>'    a substring delimited by '<xxx>' and '</xxx>'
-- '<xxx>(.)</xxx>'  capture a substring delimited by '<xxx>' and '</xxx>', and exclude delimiters
-- '<xxx>(.-)</xxx>' capture a substring delimited by '<xxx>' and '</xxx>', the smallest possible, and exclude delimiters
    local data = "<xxx> yyy </xxx>"
    print("pattern: '"..pattern .. "'" )
    print("input: '" .. data .. "'")
    print("result:")
    local tag = string.match(data, pattern) 
    print("'" .. tag .. "'" )
    
    local pattern = '(<.->)([^<]*)' -- returns all pairs: <html tags>, text between html tags 
-- pattern explained step by step:
--  '<.>'   a substring delimited by '<' and '>'
--  '<.->'  a substring delimited by '<' and '>', the smallest possible,
-- '(<.->)' capture a substring delimited by '<' and '>', the smallest possible, and include delimiters 
--        '[^<]'   a substring of chars that are not '<', 
--        '[^<]*'  a substring of chars that are not '<', the biggest possible, minimum 0 char.
--       '([^<]*)' capture a substring of chars that are not '<', the biggest possible, minimum 0 char. 
-- lets call these 2 patterns [A] and [B]. The whole pattern is:
-- '(<.->)([^<]*)' capture a substring according to [A], then, starting after the last char defined in [A], capture a substring according to [B]
    local data = "<xxx> yyy </xxx>"
    print("pattern: '"..pattern .. "'" )
    print("input: '" .. data .. "'")
    print("result:")
    for tag, inner in string.gmatch(data,pattern) do
        print("'" .. tag .. "', '" .. inner .. "'" )
    end
end

--[[

-- i found this nice summary on the web:
http://www.gammon.com.au/scripts/doc.php?lua=string.find

Patterns

The standard patterns (character classes) you can search for are:


 . --- (a dot) represents all characters. 
%a --- all letters. 
%c --- all control characters. 
%d --- all digits. 
%l --- all lowercase letters. 
%p --- all punctuation characters. 
%s --- all space characters. 
%u --- all uppercase letters. 
%w --- all alphanumeric characters. 
%x --- all hexadecimal digits. 
%z --- the character with hex representation 0x00 (null). 
%% --- a single '%' character.
%1 --- captured pattern 1.
%2 --- captured pattern 2 (and so on).
%f[s]  transition from not in set 's' to in set 's'.
%b()   balanced pair ( ... ) 


Important! - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits.

Also important! If you are using string.find (or string.match etc.) in MUSHclient, and inside "send to script" in a trigger or alias, then the % sign has special meaning there (it is used to identify wildcards, such as %1 is wildcard 1). Thus the % signs in string.find need to be doubled or they won't work properly (so use %%d instead of %d in "send to script").


There are some "magic characters" (such as %) that have special meanings. These are:


^ $ ( ) % . [ ] * + - ? 


If you want to use those in a pattern (as themselves) you must precede them by a % symbol.

eg. %% would match a single %

You can build your own pattern classes (sets) by using square brackets, eg.


[abc] ---> matches a, b or c
[a-z] ---> matches lowercase letters (same as %l)
[^abc] ---> matches anything except a, b or c
[%a%d] ---> matches all letters and digits
[%a%d_] ---> matches all letters, digits and underscore
[%[%]] ---> matches square brackets (had to escape them with %)

--[[
You can use pattern classes in the form %x in the set. If you use other characters (like periods and brackets, etc.) they are simply themselves.

You can specify a range of character inside a set by using simple characters (not pattern classes like %a) separated by a hyphen. For example, [A-Z] or [0-9]. These can be combined with other things. For example [A-Z0-9] or [A-Z,.].

The end-points of a range must be given in ascending order. That is, [A-Z] would match upper-case letters, but [Z-A] would not match anything.

You can negate a set by starting it with a "^" symbol, thus [^0-9] is everything except the digits 0 to 9. The negation applies to the whole set, so [^%a%d] would match anything except letters or digits. In anywhere except the first position of a set, the "^" symbol is simply itself.

Inside a set (that is a sequence delimited by square brackets) the only "magic" characters are:


] ---> to end the set, unless preceded by %
% ---> to introduce a character class (like %a), or magic character (like "]")
^ ---> in the first position only, to negate the set (eg. [^A-Z)
- ---> between two characters, to specify a range (eg. [A-F])


Thus, inside a set, characters like "." and "?" are just themselves.

The repetition characters, which can follow a character, class or set, are:


+  ---> 1 or more repetitions (greedy)
*  ---> 0 or more repetitions (greedy)
-  ---> 0 or more repetitions (non greedy)
?  ---> 0 or 1 repetition only


A "greedy" match will match on as many characters as possible, a non-greedy one will match on as few as possible.

The standard "anchor" characters apply:


^  ---> anchor to start of subject string (must be the very first character)
$  ---> anchor to end of subject string


You can also use round brackets to specify "captures":


You see (.*) here


Here, whatever matches (.*) becomes the first pattern.

You can also refer to matched substrings (captures) later on in an expression:


print (string.find ("You see dogs and dogs", "You see (.*) and %1")) --> 1 21 dogs
print (string.find ("You see dogs and cats", "You see (.*) and %1")) --> nil


This example shows how you can look for a repetition of a word matched earlier, whatever that word was ("dogs" in this case).

As a special case, an empty capture string returns as the captured pattern, the position of itself in the string. eg.


print (string.find ("You see dogs and cats", "You .* ()dogs .*")) --> 1 21 9


What this is saying is that the word "dogs" starts at column 9.

Finally you can look for nested "balanced" things (such as parentheses) by using %b, like this:


print (string.find ("I see a (big fish (swimming) in the pond) here",
       "%b()"))  --> 9 41


After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)".



Examples of string.find:


print (string.find ("the quick brown fox", "quick")) --> 5 9
print (string.find ("the quick brown fox", "(%a+)")) --> 1 3 the
print (string.find ("the quick brown fox", "(%a+)", 10)) --> 11 15 brown
print (string.find ("the quick brown fox", "fruit")) --> nil
See Also ...

Lua functions

 string.byte - Converts a character into its ASCII (decimal) equivalent 
 string.char - Converts ASCII codes into their equivalent characters 
 string.dump - Converts a function into binary 
 string.format - Formats a string 
 string.gfind - Iterate over a string (obsolete in Lua 5.1) 
 string.gmatch - Iterate over a string 
 string.gsub - Substitute strings inside another string 
 string.len - Return the length of a string 
 string.lower - Converts a string to lower-case 
 string.match - Searches a string for a pattern 
 string.rep - Returns repeated copies of a string 
 string.reverse - Reverses the order of characters in a string 
 string.sub - Returns a substring of a string 
 string.upper - Converts a string to upper-case 

--]]

dave1707 · September 3, 2014, 3:11pm

@Jmv38 I think @SkyTheCoder also did something on patterns. I didn’t look for it, but I’m sure he will find it and comment.

LoopSpace · September 3, 2014, 3:18pm

I know that patterns are not full regexp, but I can’t resist linking to this answer.

There are XML parsers in lua and I did get one working in Codea some time ago. I would recommend using one.

Jmv38 · September 3, 2014, 3:38pm

@loopspace lol!
i’am not trying to parse html, though, just to extract a couple data…

firewolf · July 11, 2015, 8:19am

brill, i learn a bit

firewolf · July 15, 2015, 6:59am

like this, http://www.smule.com/songbook/scores/4817 , i wanna get tab={1,2,3} from its html “note small” “data note id”