Previous | ToC | Up | Next |
Alice: Time to open the black box that contains the helper class that
does all the work behind the scenes, for each individual option. Can you
show me the whole class definition, so that I get an idea of what it looks
like, before we inspect each method?
Bob: Here you are:
Alice: That's not as long as I thought it would be.
Bob: One of the great things of Ruby: because the notation is so compact,
and because you don't have to worry about types and declarations and all
that sort of stuff, you can write quite powerful codes in just a few pages.
Alice: Let's step through the Clop_option class. The initializer
starts off just like it did on the higher Clop class level. In
that case, the first line was:
while here we have only one line, the same apart from the final "s":
And that makes sense, since by definition this helper class takes care
of only one option at a time.
Bob: The next method shows how the parsing gets started:
Remember how we wrote the definition of an option block: first we write
one line for each piece of information, such as
This means that the parsing process is somewhat different from the
single-line instructions and for the last multi-line block. The method
parse_single_lines_done? takes care of the single lines, while
the last few lines of the parse_option_definition method take
care of the multi-line block.
Of course, I could have written a separated parse_multiple_lines
method, but that seemed to be a bit of overkill, given that the work can
be specified in just a few lines:
You just keep taking lines off from the def_str that contained
the whole here document, and when you encounter two successive blank
lines, you stop. Remember, we had agreed that two blank lines would
signal the start of a new option block.
Alice: So all the rest of the parsing is done in the method
parse_single_lines_done?. Why the question mark at the end
of the name?
Bob: This is a nice feature of Ruby, that it allows you to add a
question mark or exclamation mark at the end of the name. You can't
use it as a general character in the middle of a name; it can only
appear at the end. Its use is to communicate to the human reader
something of the intention of the program: in this case, you might
guess that a boolean value is being returned by this method. If the
value is true, then indeed we are done parsing the single lines. If
the value is false, we aren't done yet.
Alice: I like that, that does make the intention clearer.
Bob: Here is the method:
Alice: I see that you start off with another exercise in regular
expressions, but this one puzzles me:
Why do you add a ? after an * ? That seems to be
redundant. The * tells you to expect zero or more instances
of the previous character, while the ? tells you to expect
just zero or one instances. No, I take that back, it is not even
redundant, it seems wrong, since ? would be expected to follow
the previous character, and here the * is in the way.
Bob: You should consider the combination of the two characters as
one unit: *? is defined as a `non-greedy' version of the
* wild-card character.
Alice: Non-greedy?
Bob: Yes. Normally, the wild card notation is interpreted in a `greedy'
way: it gobbles up as much as it can.
Alice: I would call it a `hungry' way in that case. Can you give me a
simple example of the difference?
Bob: Sure. Let's use our friend irb again:
Now in the second regular expression I have added the question mark to make
the match non-greedy. In this case, the string "def:xyz" remains,
which means that the match only included "abc:". This was
the first match that satisfied the minimal requirement of having an
arbitrary number of characters ending with a colon. Our non-greedy operator
*? was satisfied at this point, while its greedy colleague *
kept looking for a longer match, and indeed found one.
Alice: Very nice to have the option to stop early. And in our case, this
means that you are allowed to include colons in the definitions, without
confusing the parser, right?
Bob: Indeed. Every line among the single-line definitions has the
structure:
Alice: I like the idea of keeping maximum flexibility for option
specifications, rather than excluding characters like a colon. Good!
And I see in the next line that you raise an error if you find no
colon at all.
Bob: Yes. And if a colon is found, everything before the first colon
is assigned to the variable name and everything after that colon to
the variable content.
Alice: I understand how content gets its content, so to speak, since
$' is by definition what is left over after the match. And I also
understand that $& would not have been a good choice for name,
since it would have included the colon. I probably would have started
with $& and stripped off the last character.
Bob: That would still not be right, since in most cases you would have
wound up with a name that contained trailing spaces. You could have taken
those off too, of course, but I found a quicker way to do everything
at once. They key is given in the use of the parenthesis in the first
line:
Alice: That one line is a rich line indeed! What does (\w.*?)
mean?
Bob: in general, parentheses in a regular expression can be used for two
purposes: they allow you to group characters together and they also allow
you to collect particular parts of the match results that you might be
interested. An example of the first use is to write /(na)*/.
This specifies that the group of letters na is to be repeated an arbitrary
number of times. In a word like banana, it matches against the
nana part. An example of the second use is what we see here in the code.
When parts of a regular expression are put within parentheses, the
variable $1 will be given the string that matches the content
of the first set of parentheses, the variable $2 will receive
a string containing the content of the second parentheses delimited match,
and so on. Here there is only one set of parentheses, enclosing whatever
appears after initial white spaces, and before the first colon.
To be specific, a match against the (\w.*?) part requires there
to be at least one alphanumeric character or underscore, corresponding to
\w, followed by arbitrary characters. Since the :
in the regular expression /\s*(\w.*?)\:/ is placed outside the
parentheses, the colon does not appear in the value of the variable
$1, but everything else up to the colon does appear, apart from
possible white space before the colon. Therefore, $1 will
contain the complete name, with any leading or trailing white space removed.
Actually, removing those leading and trailing white space characters was
not really necessary, as you will see below, since we're only matching the
`name' part of the definitions against various possibilities, and those
matches would work fine with blank space left in place. I just decided
to be extra neat, for a change.
5. The Second Journey: Clop_option
5.1. Code Listing
5.2. Parsing An Option Definition
parse_option_definitions(def_str)
def initialize(def_str)
parse_option_definition(def_str)
end
Short name: -o
or
Value type: string
Only at the end do we allow an arbitrarily long multi-line description
of what the option is all about. That was the line called Long
Description. It contains the information that will be echoed when
we ask for --help on the command line.
5.3. Are we Done Yet?
5.4. A Non-Greedy Wild Card
|gravity> irb
irb(main):001:0> s = "abc:def:xyz"
=> "abc:def:xyz"
irb(main):002:0> s =~ /.*:/ ; $'
=> "xyz"
irb(main):003:0> s =~ /.*?:/ ; $'
=> "def:xyz"
In the first regular expression, I ask for a match with an arbitrary number
of characters of any type, followed by a colon. The period can stand
for any character except a new line. As you can see, after the match, what
is left over is "xyz" so the match went all the way to the
second colon.
<name> : <content>
I do not allow a colon ":" to appear in the `name' part of
the line, but I do allow colons to appear in the `content' part. This is
yet another example of trying not to limit the user unnecessarily. An
example I thought about is where someone might want to define a classification
for stars, and for some reason decides that it is convenient to use colons.
Options to assign stars of different classes could take on the form:
--star_type "star : MS"
for a main sequence star, or
--star_type "star : MS : ZAMS"
as a further specialization, to indicate a zero-age-main-sequence
star. A giant on the asymptotic giant branch could be specified as:
--star_type "star: giant: AGB"
In all these cases, the non-greedy parser instruction will extract the content
part of the line correctly.
5.5. Extracting the Name from a Definition
Previous | ToC | Up | Next |