[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11. Macros for text handling

There are a number of builtins in m4 for manipulating text in various ways, extracting substrings, searching, substituting, and so on.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.1 Calculating length of strings

The length of a string can be calculated by len:

Builtin: len (string)

Expands to the length of string, as a decimal number.

The macro len is recognized only with parameters.

 
len()
⇒0
len(`abcdef')
⇒6

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.2 Searching for substrings

Searching for substrings is done with index:

Builtin: index (string, substring)

Expands to the index of the first occurrence of substring in string. The first character in string has index 0. If substring does not occur in string, index expands to `-1'.

The macro index is recognized only with parameters.

 
index(`gnus, gnats, and armadillos', `nat')
⇒7
index(`gnus, gnats, and armadillos', `dag')
⇒-1

Omitting substring evokes a warning, but still produces output; contrast this with an empty substring.

 
index(`abc')
error-->m4:stdin:1: Warning: too few arguments to builtin `index'
⇒0
index(`abc', `')
⇒0
index(`abc', `b')
⇒1

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.3 Searching for regular expressions

Searching for regular expressions is done with the builtin regexp:

Builtin: regexp (string, regexp, [replacement]@c)

Searches for regexp in string. The syntax for regular expressions is the same as in GNU Emacs, which is similar to BRE (Basic Regular Expressions) in POSIX. See Syntax of Regular Expressions in the GNU Emacs Manual. Support for ERE (Extended Regular Expressions) is not available, but will be added in GNU M4 2.0.

If replacement is omitted, regexp expands to the index of the first match of regexp in string. If regexp does not match anywhere in string, it expands to -1.

If replacement is supplied, and there was a match, regexp changes the expansion to this argument, with `\n' substituted by the text matched by the nth parenthesized sub-expression of regexp, up to nine sub-expressions. The escape `\&' is replaced by the text of the entire regular expression matched. For all other characters, `\' treats the next character literally. A warning is issued if there were fewer sub-expressions than the `\n' requested, or if there is a trailing `\'. If there was no match, regexp expands to the empty string.

The macro regexp is recognized only with parameters.

 
regexp(`GNUs not Unix', `\<[a-z]\w+')
⇒5
regexp(`GNUs not Unix', `\<Q\w*')
⇒-1
regexp(`GNUs not Unix', `\w\(\w+\)$', `*** \& *** \1 ***')
⇒*** Unix *** nix ***
regexp(`GNUs not Unix', `\<Q\w*', `*** \& *** \1 ***')
⇒

Here are some more examples on the handling of backslash:

 
regexp(`abc', `\(b\)', `\\\10\a')
⇒\b0a
regexp(`abc', `b', `\1\')
error-->m4:stdin:2: Warning: sub-expression 1 not present
error-->m4:stdin:2: Warning: trailing \ ignored in replacement
⇒
regexp(`abc', `\(\(d\)?\)\(c\)', `\1\2\3\4\5\6')
error-->m4:stdin:3: Warning: sub-expression 4 not present
error-->m4:stdin:3: Warning: sub-expression 5 not present
error-->m4:stdin:3: Warning: sub-expression 6 not present
⇒c

Omitting regexp evokes a warning, but still produces output; contrast this with an empty regexp argument.

 
regexp(`abc')
error-->m4:stdin:1: Warning: too few arguments to builtin `regexp'
⇒0
regexp(`abc', `')
⇒0
regexp(`abc', `', `\\def')
⇒\def

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.4 Extracting substrings

Substrings are extracted with substr:

Builtin: substr (string, from, [length]@c)

Expands to the substring of string, which starts at index from, and extends for length characters, or to the end of string, if length is omitted. The starting index of a string is always 0. The expansion is empty if there is an error parsing from or length, if from is beyond the end of string, or if length is negative.

The macro substr is recognized only with parameters.

 
substr(`gnus, gnats, and armadillos', `6')
⇒gnats, and armadillos
substr(`gnus, gnats, and armadillos', `6', `5')
⇒gnats

Omitting from evokes a warning, but still produces output.

 
substr(`abc')
error-->m4:stdin:1: Warning: too few arguments to builtin `substr'
⇒abc
substr(`abc',)
error-->m4:stdin:2: empty string treated as 0 in builtin `substr'
⇒abc

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.5 Translating characters

Character translation is done with translit:

Builtin: translit (string, chars, [replacement]@c)

Expands to string, with each character that occurs in chars translated into the character from replacement with the same index.

If replacement is shorter than chars, the excess characters of chars are deleted from the expansion; if chars is shorter, the excess characters in replacement are silently ignored. If replacement is omitted, all characters in string that are present in chars are deleted from the expansion. If a character appears more than once in chars, only the first instance is used in making the translation. Only a single translation pass is made, even if characters in replacement also appear in chars.

As a GNU extension, both chars and replacement can contain character-ranges, e.g., `a-z' (meaning all lowercase letters) or `0-9' (meaning all digits). To include a dash `-' in chars or replacement, place it first or last in the entire string, or as the last character of a range. Back-to-back ranges can share a common endpoint. It is not an error for the last character in the range to be `larger' than the first. In that case, the range runs backwards, i.e., `9-0' means the string `9876543210'. The expansion of a range is dependent on the underlying encoding of characters, so using ranges is not always portable between machines.

The macro translit is recognized only with parameters.

 
translit(`GNUs not Unix', `A-Z')
⇒s not nix
translit(`GNUs not Unix', `a-z', `A-Z')
⇒GNUS NOT UNIX
translit(`GNUs not Unix', `A-Z', `z-a')
⇒tmfs not fnix
translit(`+,-12345', `+--1-5', `<;>a-c-a')
⇒<;>abcba
translit(`abcdef', `aabdef', `bcged')
⇒bgced

In the ASCII encoding, the first example deletes all uppercase letters, the second converts lowercase to uppercase, and the third `mirrors' all uppercase letters, while converting them to lowercase. The two first cases are by far the most common, even though they are not portable to EBCDIC or other encodings. The fourth example shows a range ending in `-', as well as back-to-back ranges. The final example shows that `a' is mapped to `b', not `c'; the resulting `b' is not further remapped to `g'; the `d' and `e' are swapped, and the `f' is discarded.

Omitting chars evokes a warning, but still produces output.

 
translit(`abc')
error-->m4:stdin:1: Warning: too few arguments to builtin `translit'
⇒abc

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.6 Substituting text by regular expression

Global substitution in a string is done by patsubst:

Builtin: patsubst (string, regexp, [replacement]@c)

Searches string for matches of regexp, and substitutes replacement for each match. The syntax for regular expressions is the same as in GNU Emacs (see section Searching for regular expressions).

The parts of string that are not covered by any match of regexp are copied to the expansion. Whenever a match is found, the search proceeds from the end of the match, so a character from string will never be substituted twice. If regexp matches a string of zero length, the start position for the search is incremented, to avoid infinite loops.

When a replacement is to be made, replacement is inserted into the expansion, with `\n' substituted by the text matched by the nth parenthesized sub-expression of patsubst, for up to nine sub-expressions. The escape `\&' is replaced by the text of the entire regular expression matched. For all other characters, `\' treats the next character literally. A warning is issued if there were fewer sub-expressions than the `\n' requested, or if there is a trailing `\'.

The replacement argument can be omitted, in which case the text matched by regexp is deleted.

The macro patsubst is recognized only with parameters.

 
patsubst(`GNUs not Unix', `^', `OBS: ')
⇒OBS: GNUs not Unix
patsubst(`GNUs not Unix', `\<', `OBS: ')
⇒OBS: GNUs OBS: not OBS: Unix
patsubst(`GNUs not Unix', `\w*', `(\&)')
⇒(GNUs)() (not)() (Unix)()
patsubst(`GNUs not Unix', `\w+', `(\&)')
⇒(GNUs) (not) (Unix)
patsubst(`GNUs not Unix', `[A-Z][a-z]+')
⇒GN not 
patsubst(`GNUs not Unix', `not', `NOT\')
error-->m4:stdin:6: Warning: trailing \ ignored in replacement
⇒GNUs NOT Unix

Here is a slightly more realistic example, which capitalizes individual words or whole sentences, by substituting calls of the macros upcase and downcase into the strings.

Composite: upcase (text)
Composite: downcase (text)
Composite: capitalize (text)

Expand to text, but with capitalization changed: upcase changes all letters to upper case, downcase changes all letters to lower case, and capitalize changes the first character of each word to upper case and the remaining characters to lower case.

First, an example of their usage, using implementations distributed in `m4-1.4.13/examples/capitalize.m4'.

 
$ m4 -I examples
include(`capitalize.m4')
⇒
upcase(`GNUs not Unix')
⇒GNUS NOT UNIX
downcase(`GNUs not Unix')
⇒gnus not unix
capitalize(`GNUs not Unix')
⇒Gnus Not Unix

Now for the implementation. There is a helper macro _capitalize which puts only its first word in mixed case. Then capitalize merely parses out the words, and replaces them with an invocation of _capitalize. (As presented here, the capitalize macro has some subtle flaws. You should try to see if you can find and correct them; or see section Answers).

 
$ m4 -I examples
undivert(`capitalize.m4')dnl
⇒divert(`-1')
⇒# upcase(text)
⇒# downcase(text)
⇒# capitalize(text)
⇒#   change case of text, simple version
⇒define(`upcase', `translit(`$*', `a-z', `A-Z')')
⇒define(`downcase', `translit(`$*', `A-Z', `a-z')')
⇒define(`_capitalize',
⇒       `regexp(`$1', `^\(\w\)\(\w*\)',
⇒               `upcase(`\1')`'downcase(`\2')')')
⇒define(`capitalize', `patsubst(`$1', `\w+', `_$0(`\&')')')
⇒divert`'dnl

While regexp replaces the whole input with the replacement as soon as there is a match, patsubst replaces each occurrence of a match and preserves non-matching pieces:

 
define(`patreg',
`patsubst($@)
regexp($@)')dnl
patreg(`bar foo baz Foo', `foo\|Foo', `FOO')
⇒bar FOO baz FOO
⇒FOO
patreg(`aba abb 121', `\(.\)\(.\)\1', `\2\1\2')
⇒bab abb 212
⇒bab

Omitting regexp evokes a warning, but still produces output; contrast this with an empty regexp argument.

 
patsubst(`abc')
error-->m4:stdin:1: Warning: too few arguments to builtin `patsubst'
⇒abc
patsubst(`abc', `')
⇒abc
patsubst(`abc', `', `\\-')
⇒\-a\-b\-c\-

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.7 Formatting strings (printf-like)

Formatted output can be made with format:

Builtin: format (format-string, …)

Works much like the C function printf. The first argument format-string can contain `%' specifications which are satisfied by additional arguments, and the expansion of format is the formatted string.

The macro format is recognized only with parameters.

Its use is best described by a few examples:

 
define(`foo', `The brown fox jumped over the lazy dog')
⇒
format(`The string "%s" uses %d characters', foo, len(foo))
⇒The string "The brown fox jumped over the lazy dog" uses 38 characters
format(`%*.*d', `-1', `-1', `1')
⇒1
format(`%.0f', `56789.9876')
⇒56790
len(format(`%-*X', `5000', `1'))
⇒5000
ifelse(format(`%010F', `infinity'), `       INF', `success',
       format(`%010F', `infinity'), `  INFINITY', `success',
       format(`%010F', `infinity'))
⇒success
ifelse(format(`%.1A', `1.999'), `0X1.0P+1', `success',
       format(`%.1A', `1.999'), `0X2.0P+0', `success',
       format(`%.1A', `1.999'))
⇒success
format(`%g', `0xa.P+1')
⇒20

Using the forloop macro defined earlier (see section Iteration by counting), this example shows how format can be used to produce tabular output.

 
$ m4 -I examples
include(`forloop.m4')
⇒
forloop(`i', `1', `10', `format(`%6d squared is %10d
', i, eval(i**2))')
⇒     1 squared is          1
⇒     2 squared is          4
⇒     3 squared is          9
⇒     4 squared is         16
⇒     5 squared is         25
⇒     6 squared is         36
⇒     7 squared is         49
⇒     8 squared is         64
⇒     9 squared is         81
⇒    10 squared is        100
⇒

The builtin format is modeled after the ANSI C `printf' function, and supports these `%' specifiers: `c', `s', `d', `o', `x', `X', `u', `a', `A', `e', `E', `f', `F', `g', `G', and `%'; it supports field widths and precisions, and the flags `+', `-', ` ', `0', `#', and `''. For integer specifiers, the width modifiers `hh', `h', and `l' are recognized, and for floating point specifiers, the width modifier `l' is recognized. Items not yet supported include positional arguments, the `n', `p', `S', and `C' specifiers, the `z', `t', `j', `L' and `ll' modifiers, and any platform extensions available in the native printf. For more details on the functioning of printf, see the C Library Manual, or the POSIX specification (for example, `%a' is supported even on platforms that haven't yet implemented C99 hexadecimal floating point output natively).

Unrecognized specifiers result in a warning. It is anticipated that a future release of GNU m4 will support more specifiers, and give better warnings when various problems such as overflow are encountered. Likewise, escape sequences are not yet recognized.

 
format(`%p', `0')
error-->m4:stdin:1: Warning: unrecognized specifier in `%p'
⇒

[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated on July, 20 2009 using texi2html 1.76.