Inspired by Ohloh, and a need to start scoping migration of a large cluster of Drupal sites from Drupal 4.7 to Drupal 5.x/6.x, I've started some work on a static analysis / code metrics tool
specifically intended for PHP / Drupal.
Code is available from Bryght's public svn repository, with repo URL https://svn.bryght.com/dev/svn/scripts/metrics.
The metrics.php program is currently implemented as a command line
script, which is pointed at a directory containing code to be
analyzed:
% php metrics.php code_dir
Reports are generated on standard output.
History
This code started from the sloccount.php script written by Arto
Bendiken. I've tried to extend the script by making it more
comprehensive and precise, but also more flexible and general. The
architecture is intended to be pluggable, allowing users to easily
write analysis tools for new types of files.
Currently there are two supported file types: generic, and PHP.
The 'generic' code analysis supports 'C' style code, with in-line
comments initiated by '//' and comment blocks set of by '/*' and '*/'.
This gives rough metrics for javascript and CSS files.
The PHP analyzer uses PHP's tokenizer and a rudimentary parser to
obtain additional information about PHP code.
Analysis is completely separated from reporting: the analysis
phase builds an array of metrics per file, where each metric
can itself be structured.
Current reporting includes basic statistics about number of lines, number of comments (inline and doc style), number of functions, and number of tokens (identified by class, e.g., control, operator,....) as well as variable and const listings (and lines where they are used), functions and lines where they are defined, a basic CoCoMo estimate, and frequency distribution for PHP tokens.
Future work
This is currently very rough. Now that I've written an initial version,
I can see that it needs to be completely re-done.
Currently the code is extensible to handle different file types -
though the details of this need to be made more sane. However, really
the analysis and reporting should be object oriented to take advantage
of inheritance - there should be a Drupal analyzer, which extends the
PHP analyzer, which in turn extends the generic analyzer.
There's lots of room for creativity in terms of packaging this,
handling more file types, displaying output, and building more
meaningful models.
An easy project (at least the first 80%) would be to write a CSS
tokenizer / parser.
One interest I would like to pursue is cost models that are
statistically based, with parameters computed from analyzing
known costs of real-world code.
Measuring complexity based on factors such as Boolean and arithmetic
expressions, recursion, coverage of particular parts of the API (e.g.,
complexity of SQL statements) - all of these are possible extensions.
Another is Drupal-specific analysis, which would be able to look at
the number of hooks implemented, the version of Drupal which code is
written to, the number of core vs. contrib vs. non-Drupal functions
called per code unit.
An interesting project would be to develop a cost model for updating a
particular site to a given version of Drupal, based on API changes,
and estimates of amount of code that needs to be changed.
Selection and automatic generation of test cases (unit test stubs)
might also be an application.
Feedback and contributions are very welcome!
See attached text file for an example of (simplified) output of metrics.php run on the CVS checkout of the contributions repository DRUPAL-4-7 tag on Nov 9, 2007:
- 2990 PHP files
- 267 javascript files
- 495 css files
- 632 other ('interesting') files
for a total of
4,384 files analyzed.
45,248 KB of code in files considered; containing
874,838 lines of code and
23,790 function definitions. Total CoCoMo 'organic' model cost:
$10,245,457.00 at $60,000 per year per developer.