[Subversion] / SCALE / scale / dsl.txt  

View of /SCALE/scale/dsl.txt

Parent Directory | Revision Log
Revision: 2086 - (download)
Sun Sep 4 07:06:35 2005 UTC (18 years, 7 months ago) by pje
File size: 7127 byte(s)
Created a new project for SCALE - the Syntax for Configuration
And Language Extensions.  So far, this just includes a nifty
generic parser for "Python-like" languages.  One of the examples
that shows the power of the parser is a 7-line Python code 
reformatter that can reindent Python code without changing its
semantics, properly handling comments, multi-line strings,
continued lines, and much much more.  The actual SCALE mini-
language will be implemented atop the generic parser, and it
should be pretty easy to create other domain-specific languages
(like parser generators and other code generators) atop it as
well.
=================================
Using SCALE's DSL Parsing Library
=================================

    >>> from scale import dsl


Tokenizing Text
===============

Most of the DSL parsing API operates on sequences or iterators of tokens, as
generated by the standard library ``tokenize`` module.  You can use that module
directly, or you can use these convenience functions to do the tokenizing:

tokenize_string(text)
    Yield the tokens of `text`

tokenize_stream(file)
    Yield the tokens found in the open iterable stream `file`

tokenize_file(filename)
    Open `filename` for text reading, and yield its tokens

All of these functions support source encoding comments and BOM markers as
prescribed by `PEP 263 <http://www.python.org/peps/pep-0263.html>`_.  However,
if you supply a unicode string to ``tokenize_string()`` or a unicode stream to
``tokenize_stream()``, any PEP 263 source encoding information will be ignored,
as it will be assumed you already have done any necessary decoding.

Example usage::

    >>> from tokenize import tok_name

    >>> [(tok_name[t],v) for t,v,s,e,line in dsl.tokenize_string("1+2")]
    [('NUMBER', '1'), ('OP', '+'), ('NUMBER', '2'), ('ENDMARKER', '')]


Converting Tokens Back to Text
==============================

The ``detokenize()`` function converts an iterable of tokens back into a
string::

    >>> print dsl.detokenize(dsl.tokenize_string("1+2"))
    1+2
    >>> print dsl.detokenize(dsl.tokenize_string("1+ 2   #foo"))
    1+ 2   #foo

The resulting string will have every token on its original line in
the input::

    >>> print dsl.detokenize(dsl.tokenize_string("""\
    ... 1+  \\
    ...    \\
    ... 2"""))
    1+  \
    \
    2

But the tokens will be shifted to the left such that the first non-whitespace,
non-comment token is in the first column of the output::

    >>> print dsl.detokenize(dsl.tokenize_string("""
    ...     print '''foo
    ...     bar''' + '''spam
    ...     baz''';"""))
    <BLANKLINE>
    print '''foo
        bar''' + '''spam
        baz''';

unless you use the optional ``indent`` parameter to change the default
indentation::

    >>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=2)
    '  print foo'
    >>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=4)
    '    print foo'

    >>> print dsl.detokenize(dsl.tokenize_string("""\
    ... 1+  \\
    ...    \\
    ... 2"""), indent=4)
        1+  \
        \
        2


But note that re-indentation doesn't affect the contents of multi-line
strings, such as docstrings.  That is, after reindenting, the string has
the same value it did before, even if it makes the contents of the string look
odd::

    >>> print dsl.detokenize(dsl.tokenize_string("""\
    ...         # a comment that's oddly indented - or is it?
    ...     def x():
    ...         '''more than one
    ...            line in the docstring'''
    ... """), indent=12)
                # a comment that's oddly indented - or is it?
                def x():
                    '''more than one
               line in the docstring'''
    <BLANKLINE>

Notice also that any comments occurring before the first non-whitespace token
in the token stream are formatted flush left to the indent column, regardless
of their position in the input.  (This is because ``detokenize()`` doesn't know
how far to offset the input lines from their starting positions until it
encounters a non-comment token.)

You can strip whitespace tokens like indents, dedents, comments, and newlines
from a token list with ``strip_ws()``, which yields all the non-whitespace
tokens in a sequence::

    >>> dsl.detokenize(dsl.strip_ws(dsl.tokenize_string("123 #xyz")))
    '123'

``strip_ws()`` is intended to make parsing individual statements easier.  But
you should not use it on token streams that span more than one logical line,
because the ``NEWLINE`` whitespace token separates logical lines,
and the ``INDENT`` and ``DEDENT`` tokens are used to identify blocks.  With
these tokens removed, parsing blocks into statements and nested blocks becomes
impossible.  Therefore, if you are doing `block parsing`_ you should only strip
whitespace from individual statements, not from the input to ``parse_block()``.


Block Parsing
=============

The ``parse_block(tokens)`` function turns an iterable of tokens into a
**block**, which is a list of statements and the blocks that appear indented
under those statements.  More specifically, it is a list of two-item
"(`statement`,`block`)" tuples, where `statement` is a list of the tokens
representing a single statement, and `block` is a (possibly-empty) nested list
of "(`statement`,`block`)" pairs::

    >>> dsl.parse_block(dsl.tokenize_string("1+2"))
    [([(...'1'...), (...'+'...), (...'2'...)], [])]


Blocks can be flattened back into a token sequence using ``flatten_block()``,
so you can then detokenize the result back into a string::

    >>> print dsl.detokenize(
    ...     dsl.flatten_block(dsl.parse_block(dsl.tokenize_string("1+2")))
    ... )
    1+2

Thus, you can parse a file into a block, then traverse the statement tree and
turn sub-blocks back into strings at whatever indentation level you like.  This
is especially useful for creating parser generators or other tools that express
a high-level language that then includes blocks of Python code that must be
incorporated into their output.

Here's a more detailed example.  First, let's parse a block::

    >>> block = dsl.parse_block(dsl.tokenize_string("""\
    ... def foo():
    ...     pass
    ... def bar(baz,spam):
    ...     whee()
    ... """))

The block has two statements in it::

    >>> len(block)
    2

We'll print them, stripping whitespace so that they don't end with line feeds::

    >>> for stmt,blk in block:
    ...     print dsl.detokenize(dsl.strip_ws(stmt))
    def foo():
    def bar(baz,spam):

Now let's print the bodies of the statements, indenting them to match their
original positions::

    >>> for stmt,blk in block:
    ...     print dsl.detokenize(dsl.flatten_block(blk), indent=4)
        pass
    <BLANKLINE>
        whee()
    <BLANKLINE>

Or, to print the whole block, we can simply flatten and detokenize it::

    >>> print dsl.detokenize(dsl.flatten_block(block))
    def foo():
        pass
    def bar(baz,spam):
        whee()
    <BLANKLINE>

Now, let's create a simple code reformatter, that realigns a block and its
children to meet a uniform indentation width::

    >>> def reindent(block, indent_by=4, start=0):
    ...     out = []
    ...     for stmt,blk in block:
    ...         out.append(dsl.detokenize(stmt, indent=start))
    ...         if blk:
    ...             out.append(reindent(blk, indent_by, start+indent_by))
    ...     return ''.join(out)

    >>> print reindent(block)
    def foo():
        pass
    def bar(baz,spam):
        whee()
    <BLANKLINE>

    >>> print reindent(block, 1)
    def foo():
     pass
    def bar(baz,spam):
     whee()
    <BLANKLINE>

    >>> print reindent(block, 7, 3)
       def foo():
              pass
       def bar(baz,spam):
              whee()
    <BLANKLINE>



cvs-admin@eby-sarna.com

Powered by ViewCVS 1.0-dev

ViewCVS and CVS Help