View of /SCALE/scale/dsl.txt
Parent Directory
| Revision Log
Revision:
2086 -
(
download)
Sun Sep 4 07:06:35 2005 UTC (18 years, 7 months ago) by
pje
File size: 7127 byte(s)
Created a new project for SCALE - the Syntax for Configuration
And Language Extensions. So far, this just includes a nifty
generic parser for "Python-like" languages. One of the examples
that shows the power of the parser is a 7-line Python code
reformatter that can reindent Python code without changing its
semantics, properly handling comments, multi-line strings,
continued lines, and much much more. The actual SCALE mini-
language will be implemented atop the generic parser, and it
should be pretty easy to create other domain-specific languages
(like parser generators and other code generators) atop it as
well.
=================================
Using SCALE's DSL Parsing Library
=================================
>>> from scale import dsl
Tokenizing Text
===============
Most of the DSL parsing API operates on sequences or iterators of tokens, as
generated by the standard library ``tokenize`` module. You can use that module
directly, or you can use these convenience functions to do the tokenizing:
tokenize_string(text)
Yield the tokens of `text`
tokenize_stream(file)
Yield the tokens found in the open iterable stream `file`
tokenize_file(filename)
Open `filename` for text reading, and yield its tokens
All of these functions support source encoding comments and BOM markers as
prescribed by `PEP 263 <http://www.python.org/peps/pep-0263.html>`_. However,
if you supply a unicode string to ``tokenize_string()`` or a unicode stream to
``tokenize_stream()``, any PEP 263 source encoding information will be ignored,
as it will be assumed you already have done any necessary decoding.
Example usage::
>>> from tokenize import tok_name
>>> [(tok_name[t],v) for t,v,s,e,line in dsl.tokenize_string("1+2")]
[('NUMBER', '1'), ('OP', '+'), ('NUMBER', '2'), ('ENDMARKER', '')]
Converting Tokens Back to Text
==============================
The ``detokenize()`` function converts an iterable of tokens back into a
string::
>>> print dsl.detokenize(dsl.tokenize_string("1+2"))
1+2
>>> print dsl.detokenize(dsl.tokenize_string("1+ 2 #foo"))
1+ 2 #foo
The resulting string will have every token on its original line in
the input::
>>> print dsl.detokenize(dsl.tokenize_string("""\
... 1+ \\
... \\
... 2"""))
1+ \
\
2
But the tokens will be shifted to the left such that the first non-whitespace,
non-comment token is in the first column of the output::
>>> print dsl.detokenize(dsl.tokenize_string("""
... print '''foo
... bar''' + '''spam
... baz''';"""))
<BLANKLINE>
print '''foo
bar''' + '''spam
baz''';
unless you use the optional ``indent`` parameter to change the default
indentation::
>>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=2)
' print foo'
>>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=4)
' print foo'
>>> print dsl.detokenize(dsl.tokenize_string("""\
... 1+ \\
... \\
... 2"""), indent=4)
1+ \
\
2
But note that re-indentation doesn't affect the contents of multi-line
strings, such as docstrings. That is, after reindenting, the string has
the same value it did before, even if it makes the contents of the string look
odd::
>>> print dsl.detokenize(dsl.tokenize_string("""\
... # a comment that's oddly indented - or is it?
... def x():
... '''more than one
... line in the docstring'''
... """), indent=12)
# a comment that's oddly indented - or is it?
def x():
'''more than one
line in the docstring'''
<BLANKLINE>
Notice also that any comments occurring before the first non-whitespace token
in the token stream are formatted flush left to the indent column, regardless
of their position in the input. (This is because ``detokenize()`` doesn't know
how far to offset the input lines from their starting positions until it
encounters a non-comment token.)
You can strip whitespace tokens like indents, dedents, comments, and newlines
from a token list with ``strip_ws()``, which yields all the non-whitespace
tokens in a sequence::
>>> dsl.detokenize(dsl.strip_ws(dsl.tokenize_string("123 #xyz")))
'123'
``strip_ws()`` is intended to make parsing individual statements easier. But
you should not use it on token streams that span more than one logical line,
because the ``NEWLINE`` whitespace token separates logical lines,
and the ``INDENT`` and ``DEDENT`` tokens are used to identify blocks. With
these tokens removed, parsing blocks into statements and nested blocks becomes
impossible. Therefore, if you are doing `block parsing`_ you should only strip
whitespace from individual statements, not from the input to ``parse_block()``.
Block Parsing
=============
The ``parse_block(tokens)`` function turns an iterable of tokens into a
**block**, which is a list of statements and the blocks that appear indented
under those statements. More specifically, it is a list of two-item
"(`statement`,`block`)" tuples, where `statement` is a list of the tokens
representing a single statement, and `block` is a (possibly-empty) nested list
of "(`statement`,`block`)" pairs::
>>> dsl.parse_block(dsl.tokenize_string("1+2"))
[([(...'1'...), (...'+'...), (...'2'...)], [])]
Blocks can be flattened back into a token sequence using ``flatten_block()``,
so you can then detokenize the result back into a string::
>>> print dsl.detokenize(
... dsl.flatten_block(dsl.parse_block(dsl.tokenize_string("1+2")))
... )
1+2
Thus, you can parse a file into a block, then traverse the statement tree and
turn sub-blocks back into strings at whatever indentation level you like. This
is especially useful for creating parser generators or other tools that express
a high-level language that then includes blocks of Python code that must be
incorporated into their output.
Here's a more detailed example. First, let's parse a block::
>>> block = dsl.parse_block(dsl.tokenize_string("""\
... def foo():
... pass
... def bar(baz,spam):
... whee()
... """))
The block has two statements in it::
>>> len(block)
2
We'll print them, stripping whitespace so that they don't end with line feeds::
>>> for stmt,blk in block:
... print dsl.detokenize(dsl.strip_ws(stmt))
def foo():
def bar(baz,spam):
Now let's print the bodies of the statements, indenting them to match their
original positions::
>>> for stmt,blk in block:
... print dsl.detokenize(dsl.flatten_block(blk), indent=4)
pass
<BLANKLINE>
whee()
<BLANKLINE>
Or, to print the whole block, we can simply flatten and detokenize it::
>>> print dsl.detokenize(dsl.flatten_block(block))
def foo():
pass
def bar(baz,spam):
whee()
<BLANKLINE>
Now, let's create a simple code reformatter, that realigns a block and its
children to meet a uniform indentation width::
>>> def reindent(block, indent_by=4, start=0):
... out = []
... for stmt,blk in block:
... out.append(dsl.detokenize(stmt, indent=start))
... if blk:
... out.append(reindent(blk, indent_by, start+indent_by))
... return ''.join(out)
>>> print reindent(block)
def foo():
pass
def bar(baz,spam):
whee()
<BLANKLINE>
>>> print reindent(block, 1)
def foo():
pass
def bar(baz,spam):
whee()
<BLANKLINE>
>>> print reindent(block, 7, 3)
def foo():
pass
def bar(baz,spam):
whee()
<BLANKLINE>