[Subversion] View of /SCALE/scale/dsl.txt

View of /SCALE/scale/dsl.txt

Parent Directory

Revision Log

Revision: 2086 - (download)
Sun Sep 4 07:06:35 2005 UTC (18 years, 7 months ago) by pje
File size: 7127 byte(s)

Created a new project for SCALE - the Syntax for Configuration
And Language Extensions.  So far, this just includes a nifty
generic parser for "Python-like" languages.  One of the examples
that shows the power of the parser is a 7-line Python code 
reformatter that can reindent Python code without changing its
semantics, properly handling comments, multi-line strings,
continued lines, and much much more.  The actual SCALE mini-
language will be implemented atop the generic parser, and it
should be pretty easy to create other domain-specific languages
(like parser generators and other code generators) atop it as
well.

================================= Using SCALE's DSL Parsing Library ================================= >>> from scale import dsl Tokenizing Text =============== Most of the DSL parsing API operates on sequences or iterators of tokens, as generated by the standard library ``tokenize`` module. You can use that module directly, or you can use these convenience functions to do the tokenizing: tokenize_string(text) Yield the tokens of `text` tokenize_stream(file) Yield the tokens found in the open iterable stream `file` tokenize_file(filename) Open `filename` for text reading, and yield its tokens All of these functions support source encoding comments and BOM markers as prescribed by `PEP 263 <http://www.python.org/peps/pep-0263.html>`_. However, if you supply a unicode string to ``tokenize_string()`` or a unicode stream to ``tokenize_stream()``, any PEP 263 source encoding information will be ignored, as it will be assumed you already have done any necessary decoding. Example usage:: >>> from tokenize import tok_name >>> [(tok_name[t],v) for t,v,s,e,line in dsl.tokenize_string("1+2")] [('NUMBER', '1'), ('OP', '+'), ('NUMBER', '2'), ('ENDMARKER', '')] Converting Tokens Back to Text ============================== The ``detokenize()`` function converts an iterable of tokens back into a string:: >>> print dsl.detokenize(dsl.tokenize_string("1+2")) 1+2 >>> print dsl.detokenize(dsl.tokenize_string("1+ 2 #foo")) 1+ 2 #foo The resulting string will have every token on its original line in the input:: >>> print dsl.detokenize(dsl.tokenize_string("""\ ... 1+ \\ ... \\ ... 2""")) 1+ \ \ 2 But the tokens will be shifted to the left such that the first non-whitespace, non-comment token is in the first column of the output:: >>> print dsl.detokenize(dsl.tokenize_string(""" ... print '''foo ... bar''' + '''spam ... baz''';""")) <BLANKLINE> print '''foo bar''' + '''spam baz'''; unless you use the optional ``indent`` parameter to change the default indentation:: >>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=2) ' print foo' >>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=4) ' print foo' >>> print dsl.detokenize(dsl.tokenize_string("""\ ... 1+ \\ ... \\ ... 2"""), indent=4) 1+ \ \ 2 But note that re-indentation doesn't affect the contents of multi-line strings, such as docstrings. That is, after reindenting, the string has the same value it did before, even if it makes the contents of the string look odd:: >>> print dsl.detokenize(dsl.tokenize_string("""\ ... # a comment that's oddly indented - or is it? ... def x(): ... '''more than one ... line in the docstring''' ... """), indent=12) # a comment that's oddly indented - or is it? def x(): '''more than one line in the docstring''' <BLANKLINE> Notice also that any comments occurring before the first non-whitespace token in the token stream are formatted flush left to the indent column, regardless of their position in the input. (This is because ``detokenize()`` doesn't know how far to offset the input lines from their starting positions until it encounters a non-comment token.) You can strip whitespace tokens like indents, dedents, comments, and newlines from a token list with ``strip_ws()``, which yields all the non-whitespace tokens in a sequence:: >>> dsl.detokenize(dsl.strip_ws(dsl.tokenize_string("123 #xyz"))) '123' ``strip_ws()`` is intended to make parsing individual statements easier. But you should not use it on token streams that span more than one logical line, because the ``NEWLINE`` whitespace token separates logical lines, and the ``INDENT`` and ``DEDENT`` tokens are used to identify blocks. With these tokens removed, parsing blocks into statements and nested blocks becomes impossible. Therefore, if you are doing `block parsing`_ you should only strip whitespace from individual statements, not from the input to ``parse_block()``. Block Parsing ============= The ``parse_block(tokens)`` function turns an iterable of tokens into a **block**, which is a list of statements and the blocks that appear indented under those statements. More specifically, it is a list of two-item "(`statement`,`block`)" tuples, where `statement` is a list of the tokens representing a single statement, and `block` is a (possibly-empty) nested list of "(`statement`,`block`)" pairs:: >>> dsl.parse_block(dsl.tokenize_string("1+2")) [([(...'1'...), (...'+'...), (...'2'...)], [])] Blocks can be flattened back into a token sequence using ``flatten_block()``, so you can then detokenize the result back into a string:: >>> print dsl.detokenize( ... dsl.flatten_block(dsl.parse_block(dsl.tokenize_string("1+2"))) ... ) 1+2 Thus, you can parse a file into a block, then traverse the statement tree and turn sub-blocks back into strings at whatever indentation level you like. This is especially useful for creating parser generators or other tools that express a high-level language that then includes blocks of Python code that must be incorporated into their output. Here's a more detailed example. First, let's parse a block:: >>> block = dsl.parse_block(dsl.tokenize_string("""\ ... def foo(): ... pass ... def bar(baz,spam): ... whee() ... """)) The block has two statements in it:: >>> len(block) 2 We'll print them, stripping whitespace so that they don't end with line feeds:: >>> for stmt,blk in block: ... print dsl.detokenize(dsl.strip_ws(stmt)) def foo(): def bar(baz,spam): Now let's print the bodies of the statements, indenting them to match their original positions:: >>> for stmt,blk in block: ... print dsl.detokenize(dsl.flatten_block(blk), indent=4) pass <BLANKLINE> whee() <BLANKLINE> Or, to print the whole block, we can simply flatten and detokenize it:: >>> print dsl.detokenize(dsl.flatten_block(block)) def foo(): pass def bar(baz,spam): whee() <BLANKLINE> Now, let's create a simple code reformatter, that realigns a block and its children to meet a uniform indentation width:: >>> def reindent(block, indent_by=4, start=0): ... out = [] ... for stmt,blk in block: ... out.append(dsl.detokenize(stmt, indent=start)) ... if blk: ... out.append(reindent(blk, indent_by, start+indent_by)) ... return ''.join(out) >>> print reindent(block) def foo(): pass def bar(baz,spam): whee() <BLANKLINE> >>> print reindent(block, 1) def foo(): pass def bar(baz,spam): whee() <BLANKLINE> >>> print reindent(block, 7, 3) def foo(): pass def bar(baz,spam): whee() <BLANKLINE>

View of /SCALE/scale/dsl.txt

ViewCVS and CVS Help