Re: [sv-bc] hex number in string literal

From: Greg Jaxon <Greg.Jaxon_at_.....> Date: Fri Mar 24 2006 - 11:14:06 PST · This archive was generated by hypermail 2.1.8 : Fri Mar 24 2006 - 11:14:45 PST

Krishanu Debnath wrote:
>>>  >> Now consider this example: (note the embedded comments)
>>>  >>
>>>  >> module sample;
>>>  >>
>>>  >>     string s;
>>>  >>
>>>  >>     initial
>>>  >>     begin
>>>  >>         s = "\x41";  // this means now s is "A". ASCII value of A is 0x41.
>>>  >>         $display("value of s %s \n", s);
>>>  >>
>>>  >>         s = "\x4142"; // does this mean s is "A42" ?
>>>  >>         $display("value of s %s \n", s);
>>>  >>
>>>  >>         s = "\x41\x42"; // does this mean s is "AB" ?
>>>  >>         $display("value of s %s \n", s);
>>>  >>
>>>  >>         s = "\x4"; // less than two characters followed by x, 
>>>  >>                    // so it will be not treated as hex number.
>>>  >>         $display("value of s %s \n", s);
>>>  >>     end
>>>  >> endmodule
>>>  >>
>>>  >> Does the above make sense?

The question makes a lot of sense.  The LRM is very far from definitive
on this subject.  To be fair, the C standard where this syntax got started
is a bit ambiguous, too.  Its BNF says the octal escapes are 1-3 digits
and its hexadecimal escapes go for as long as they can.  But then C says
two confusing extra constraints:  Each octal or hexadecimal escape sequence
is the longest sequence of characters that can constitute the escape sequence.
[Subject, we assume, to the BNF 1-3 digit definition, so \177 is not \17
followed by ascii "7"?]  Secondly they want the value of the octal or
hexadecimal escape sequence to be in range of representable unsigned char
or wchar_t data.

I think we should consider that the natural intent is for each escape
sequence to produce exactly one character element of the string.  To that
end, there should be lexical cues that indicate whether the characters
are to be 8 or 16 bits wide.  Those cues have to inform the lexical scan
which upper bound to use on the escape sequence length.

I don't think SV has wchar_t strings (yet).  But, surely it is inevitable.

Octal notation for 16 bit characters is awkward - should the two pad bits
both go into the first digit, or one each in the first and fourth digits?
(\177777 vs \377377) I say neither... octal can only represent 8 bits of
a char, or 9 of a wchar.

Hex notation can represent 4 or 8 bits of a char, 4,8,12, or 16 of a wchar.
These right align in the char (wchar).

When lexing a  char string, 1-2 hex digits may be escaped.
When lexing a wchar string, 1-4 hex digits may be escaped.

Allowing one escape to create several character elements makes the literals
harder to migrate upward from char to wchar.   Requiring one
escape per char permits means you can replace \x by \x00 and get a
sensible result.  This also spares us from endianness problems trying
to convert long escape sequences into byte streams.

If anyone would like an exhaustive review of all the different flavors
of Hollerith encoding, perhaps we could find something SV has not yet claimed
to implement...;-)

Greg Jaxon