srcML is a tool for the analysis of programming language source code. It is unique in that it presents its results in the form of an annotation (using XML elements and attributes) of the original source code in a lossless manner. Like in compression, lossless here means that the original source code can be fully retrieved, inclusive lay-out in the form of white-space and indentation.
The provided annotation reflects a simple Abstract Syntax Tree (AST) of the "program". The program need not necessarily be a complete program in the sense of the syntax of the programming language; it might as well be a well-formed code snippet, like a declaration, a function definition, or a block of statements, or an expression. Unlike with other parsers that would need to know a grammar start symbol, with srcML there is no need to make this known: srcML will figure out by itself how to process the snippet. The programming languages supported by srcML are: C, C++, C#, and Java. Each language has its own unique XML elements together with a common shared set for similar language constructs. This is documented here.
The XML annotation is demonstrated here with a tiny C source code example:
[tiny.c]
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("args: %d\n", argc);
return 0;
}The following command turns the C code into the annotated XML:
$ srcml tiny.c -o tiny.xmlThis generates the following XML. Mind that we manually edited the result so that it fits on the page in this document without truncation of long lines. (Of course this destroys the preserved lay-out of the original C source.)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.srcML.org/srcML/src"
xmlns:cpp="http://www.srcML.org/srcML/cpp"
revision="0.9.5" language="C" filename="tiny.c">
<cpp:include>#<cpp:directive>include</cpp:directive>
<cpp:file><stdio.h></cpp:file>
</cpp:include>
<function><type><name>int</name></type> <name>main</name>
<parameter_list>(<parameter><decl><type><name>int</name></type>
<name>argc</name></decl></parameter>,
<parameter><decl><type><name>char</name> <modifier>*</modifier></type>
<name><name>argv</name><index>[]</index></name></decl></parameter>
)</parameter_list>
<block>{
<expr_stmt><expr><call><name>printf</name>
<argument_list>(
<argument><expr><literal type="string">"args: %d\n"</literal></expr>
</argument>,
<argument><expr><name>argc</name></expr></argument>
)</argument_list></call></expr>;</expr_stmt>
<return>return <expr><literal type="number">0</literal></expr>;</return>
}</block></function>
</unit>To restore the original C code, run this:
$ srcml tiny.xmlTo remove the enclosing <unit> element use the --output-srcml-inner option.
To render the output XML in a nicely formatted form you can use this filter:
| tidy -xml -i -q -.
(Obviously you will lose the original white-space lay-out.)
There are various tools to render the parse tree as an image. A simple one is DrawTag that accept any XML as input. See figure 1 for a visualization of the tiny.xml file.
The srcML program has many options:
GENERAL OPTIONS:
-h [ --help ] arg display this help and exit. USAGE: help or
help [module name]. MODULES: src2srcml,
srcml2src
-V [ --version ] display version number and exit
-v [ --verbose ] conversion and status information to stderr
-q [ --quiet ] suppress status messages
--list list all files in the srcML archive and
exit
-i [ --info ] display most metadata except srcML file
count and exit
-L [ --longinfo ] display all metadata including srcML file
count and exit
--max-threads arg (=4) set the maximum number of threads srcml can
spawn
-o [ --output ] arg (=stdout://-) write ouput to a file
CREATING SRCML:
-l [ --language ] arg set the language to C, C++, or Java
--register-ext arg register file extension EXT for
source-code language LANG. arg format
EXT=LANG
--src-encoding arg set the input source encoding
--files-from arg read list of source file names to form a
srcML archive
-X [ --output-xml ] output in XML instead of text
-r [ --archive ] store output in a srcML archive, default
for multiple input files
--in-order enable strict output ordering
-t [ --text ] arg raw string text to be processed
MARKUP OPTIONS:
--position include line/column attributes, namespace
'http://www.srcML.org/srcML/position'
--tabs [=arg(=8)] set tabs arg characters apart. Default
is 8
--cpp enable preprocessor parsing and markup
for Java and non-C/C++ languages
--cpp-markup-if0 markup cpp #if 0 regions
--cpp-nomarkup-else leave cpp #else regions as text
XML FORM:
-x [ --xml-encoding ] arg (=UTF-8) set output XML encoding. Default is UTF-8
--no-xml-declaration do not output the XML declaration
--no-namespace-decl do not output any namespace declarations
--xmlns arg set the default namespace to arg
--xmlns: arg set the namespace. arg format PREFIX=URI
METADATA OPTIONS:
-f [ --filename ] arg set the filename attribute
--url arg set the url attribute
-s [ --src-version ] arg set the version attribute
--hash add hash to srcml output
--timestamp add timestamp to srcml output
-p [ --prefix ] arg display prefix of namespace given by URI
arg and exit
--show-language display source language and exit
--show-url display source url name and exit
--show-filename display source filename and exit
--show-src-version display source version and exit
--show-timestamp display timestamp and exit
--show-hash display hash and exit
--show-encoding display xml encoding and exit
--show-unit-count display number of srcML files and exit
EXTRACTING SOURCE CODE:
-S [ --output-src ] output in text instead of XML
--to-dir arg extract all files from srcML and create them in the
filesystem
TRANSFORMATIONS:
--apply-root apply an xslt program or xpath query to the root
element
--relaxng arg output individual units that match RELAXNG file or URI
--xpath arg apply XPATH expression to each individual unit
--xslt arg apply XSLT file or URI transformation to each
individual unit
--attribute arg add attribute to xpath query
--element arg add element to xpath query
-U [ --unit ] arg extract individual unit number from srcMLFor instance, the XML markup can be enhanced with line and column coordinates. Notice also that srcML has built-in capabilities to query and manipulate the XML. Queries can be done with XPath expressions. General transformations can be executed with XSLT.
Here are a few examples of useful operations:
- Get all function and method definition names:
$ srcml --xpath="//src:function/src:name" program.xml- Count the number of conditions:
$ srcml --xpath='count(//src:condition)' program.xml- Output all line comments:
$ srcml --xpath='//src:comment[@type="line"]' program.xmlMuch more versatile and powerful tools to process any XML are
xidel and
xmlstarlet.
If you prefer JSON over XML, then use jtm to convert the srcML output:
jtm -i2 tiny.xml > tiny.jsonA neat application of the combination of srcML and xmlstarlet is to use them in a script to produce the call graph of a program as a JSON-Graph. It is easy to extract the actual function definitions, i.e., the source text, of all functions mentioned in the call graph.
Here we briefly sketch the main steps. Details can be found in the actual Bash scripts provided in this directory.
-
We start with using srcML to get all function definitions from a source file: (This will ignore any pre-processor directives, any global variables, and typedefs.)
srcml --xpath="//src:function" source.c -o source.xml -
Given the
NAMEof a function we then look up its definition in thesource.xmlfile and usingxmlstarletretrieve all names mentioned in function calls present in the body of that function definition:xmlstarlet -t -v "//function[name=\"NAME\"]//call/name" source.xml -
Using the capability of step 2, starting from some root function name supplied by a cmdline argument or defaulting to
mainwe build a graph of nodes that represent functions and directed edges that represent the function calls.
Notice that the call graph can have cycles. These are caused by direct
recursive or mutual recursive functions.
Once we have the graph, traversing it in reverse depth-first
order enumerates all function definitions in the proper
define-before-use order and can hence be retrieved from the
source.xml file, again using srcML to convert them back to source code.
