For codea showcase, i have to decode a little bit of html. I’ve decided to switch to patterns. Here are my first patterns explained. Since i am a noob in patterns, it may help some of you to jump into them more quickly.
function setup()
str = "12 add FGv"
for s in charSplit(str," ") do print(s) end
str = "12|add|FGv"
for s in charSplit(str,"|") do print(s) end
somePatterns()
end
function charSplit(str,c)
str = str .. c
return string.gmatch(str,"(.-)"..c.."+")
end
function somePatterns()
-- some patterns explained:
local pattern = '(%b<>)' -- finds all <html tags>
-- pattern explained step by step:
-- '%b<>' a substring delimited by '<' and '>'
-- '(%b<>)' capture a substring delimited by '<' and '>', and include delimiters
local data = "<xxx> yyy </xxx>"
print("pattern: '"..pattern .. "'" )
print("input: '" .. data .. "'")
print("result:")
for tag in string.gmatch(data, pattern) do
print("'" .. tag .. "'" )
end
local pattern = '<xxx>(.-)</xxx>' -- finds content of one html tag xxx
-- pattern explained step by step:
-- '<xxx>.</xxx>' a substring delimited by '<xxx>' and '</xxx>'
-- '<xxx>(.)</xxx>' capture a substring delimited by '<xxx>' and '</xxx>', and exclude delimiters
-- '<xxx>(.-)</xxx>' capture a substring delimited by '<xxx>' and '</xxx>', the smallest possible, and exclude delimiters
local data = "<xxx> yyy </xxx>"
print("pattern: '"..pattern .. "'" )
print("input: '" .. data .. "'")
print("result:")
local tag = string.match(data, pattern)
print("'" .. tag .. "'" )
local pattern = '(<.->)([^<]*)' -- returns all pairs: <html tags>, text between html tags
-- pattern explained step by step:
-- '<.>' a substring delimited by '<' and '>'
-- '<.->' a substring delimited by '<' and '>', the smallest possible,
-- '(<.->)' capture a substring delimited by '<' and '>', the smallest possible, and include delimiters
-- '[^<]' a substring of chars that are not '<',
-- '[^<]*' a substring of chars that are not '<', the biggest possible, minimum 0 char.
-- '([^<]*)' capture a substring of chars that are not '<', the biggest possible, minimum 0 char.
-- lets call these 2 patterns [A] and [B]. The whole pattern is:
-- '(<.->)([^<]*)' capture a substring according to [A], then, starting after the last char defined in [A], capture a substring according to [B]
local data = "<xxx> yyy </xxx>"
print("pattern: '"..pattern .. "'" )
print("input: '" .. data .. "'")
print("result:")
for tag, inner in string.gmatch(data,pattern) do
print("'" .. tag .. "', '" .. inner .. "'" )
end
end
--[[
-- i found this nice summary on the web:
http://www.gammon.com.au/scripts/doc.php?lua=string.find
Patterns
The standard patterns (character classes) you can search for are:
. --- (a dot) represents all characters.
%a --- all letters.
%c --- all control characters.
%d --- all digits.
%l --- all lowercase letters.
%p --- all punctuation characters.
%s --- all space characters.
%u --- all uppercase letters.
%w --- all alphanumeric characters.
%x --- all hexadecimal digits.
%z --- the character with hex representation 0x00 (null).
%% --- a single '%' character.
%1 --- captured pattern 1.
%2 --- captured pattern 2 (and so on).
%f[s] transition from not in set 's' to in set 's'.
%b() balanced pair ( ... )
Important! - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits.
Also important! If you are using string.find (or string.match etc.) in MUSHclient, and inside "send to script" in a trigger or alias, then the % sign has special meaning there (it is used to identify wildcards, such as %1 is wildcard 1). Thus the % signs in string.find need to be doubled or they won't work properly (so use %%d instead of %d in "send to script").
There are some "magic characters" (such as %) that have special meanings. These are:
^ $ ( ) % . [ ] * + - ?
If you want to use those in a pattern (as themselves) you must precede them by a % symbol.
eg. %% would match a single %
You can build your own pattern classes (sets) by using square brackets, eg.
[abc] ---> matches a, b or c
[a-z] ---> matches lowercase letters (same as %l)
[^abc] ---> matches anything except a, b or c
[%a%d] ---> matches all letters and digits
[%a%d_] ---> matches all letters, digits and underscore
[%[%]] ---> matches square brackets (had to escape them with %)
--[[
You can use pattern classes in the form %x in the set. If you use other characters (like periods and brackets, etc.) they are simply themselves.
You can specify a range of character inside a set by using simple characters (not pattern classes like %a) separated by a hyphen. For example, [A-Z] or [0-9]. These can be combined with other things. For example [A-Z0-9] or [A-Z,.].
The end-points of a range must be given in ascending order. That is, [A-Z] would match upper-case letters, but [Z-A] would not match anything.
You can negate a set by starting it with a "^" symbol, thus [^0-9] is everything except the digits 0 to 9. The negation applies to the whole set, so [^%a%d] would match anything except letters or digits. In anywhere except the first position of a set, the "^" symbol is simply itself.
Inside a set (that is a sequence delimited by square brackets) the only "magic" characters are:
] ---> to end the set, unless preceded by %
% ---> to introduce a character class (like %a), or magic character (like "]")
^ ---> in the first position only, to negate the set (eg. [^A-Z)
- ---> between two characters, to specify a range (eg. [A-F])
Thus, inside a set, characters like "." and "?" are just themselves.
The repetition characters, which can follow a character, class or set, are:
+ ---> 1 or more repetitions (greedy)
* ---> 0 or more repetitions (greedy)
- ---> 0 or more repetitions (non greedy)
? ---> 0 or 1 repetition only
A "greedy" match will match on as many characters as possible, a non-greedy one will match on as few as possible.
The standard "anchor" characters apply:
^ ---> anchor to start of subject string (must be the very first character)
$ ---> anchor to end of subject string
You can also use round brackets to specify "captures":
You see (.*) here
Here, whatever matches (.*) becomes the first pattern.
You can also refer to matched substrings (captures) later on in an expression:
print (string.find ("You see dogs and dogs", "You see (.*) and %1")) --> 1 21 dogs
print (string.find ("You see dogs and cats", "You see (.*) and %1")) --> nil
This example shows how you can look for a repetition of a word matched earlier, whatever that word was ("dogs" in this case).
As a special case, an empty capture string returns as the captured pattern, the position of itself in the string. eg.
print (string.find ("You see dogs and cats", "You .* ()dogs .*")) --> 1 21 9
What this is saying is that the word "dogs" starts at column 9.
Finally you can look for nested "balanced" things (such as parentheses) by using %b, like this:
print (string.find ("I see a (big fish (swimming) in the pond) here",
"%b()")) --> 9 41
After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)".
Examples of string.find:
print (string.find ("the quick brown fox", "quick")) --> 5 9
print (string.find ("the quick brown fox", "(%a+)")) --> 1 3 the
print (string.find ("the quick brown fox", "(%a+)", 10)) --> 11 15 brown
print (string.find ("the quick brown fox", "fruit")) --> nil
See Also ...
Lua functions
string.byte - Converts a character into its ASCII (decimal) equivalent
string.char - Converts ASCII codes into their equivalent characters
string.dump - Converts a function into binary
string.format - Formats a string
string.gfind - Iterate over a string (obsolete in Lua 5.1)
string.gmatch - Iterate over a string
string.gsub - Substitute strings inside another string
string.len - Return the length of a string
string.lower - Converts a string to lower-case
string.match - Searches a string for a pattern
string.rep - Returns repeated copies of a string
string.reverse - Reverses the order of characters in a string
string.sub - Returns a substring of a string
string.upper - Converts a string to upper-case
--]]