Defining a Custom LOC Counter

In the PSP, lines of code (LOC) are often used to measure the size of a software component. To produce a size measurement that is useful for planning and analysis purposes, PSP recommends that you separately account for base, added, modified, and deleted LOC. In addition, it recommends that you exclude comment lines, blank lines, and automatically-generated code from the count.

Although many free and commercial LOC counters are available, they typically only measure total LOC, and they often include comment lines in the counts they produce. For this reason, the dashboard provides a simple LOC counting tool that developers can use to measure base, added, modified, and deleted LOC in a PSP-recommended way.

To properly exclude comments from the count, this LOC counting tool must have a rudimentary understanding of the programming language syntax. Accordingly, the LOC counting tool is prepopulated with several language filters that recognize the syntax of many popular programming languages. The LOC counter can automatically handle files of the following types:

Of course, many other source code languages exist, and you may find yourself needing to count LOC for a language not handled above. Fortunately, the dashboard allows language filters to be defined and loaded dynamically, so you can create your own custom language filters for use with the dashboard's LOC counter.

To create a custom language filter, follow these steps:

  1. Decide what lines of code should be counted
  2. Create an XML definition for your language filter
  3. Save your XML language filter definition
  4. Restart the dashboard

Step 1: Decide what lines of code should be counted

In the PSP, we seek to identify size measures that correlate well to development effort. While LOC is not perfect, we nevertheless want to craft our LOC counting standard so it supports this goal.

The PSP makes several recommendations to help us get started. For example, it recommends that we do not count comments, blank lines, or autogenerated code.

Following this advice, the dashboard's LOC counter automatically ignores lines that contain only whitespace. In Step 2(c) below, you will tell the LOC counter about the comment syntax for your programming language, so it can exclude comments from the count as well.

Autogenerated code can be trickier. But occasionally you may find that a particular type of autogenerated code follows a simple pattern, and can be excluded. For example, it is common for Java IDEs to automatically manage the import statements at the top of the file; if you regularly use that IDE feature, you might choose to exclude import statements from the LOC count.

In addition, be wary that autogenerated code may be present in your source code but invisible to you! Some IDEs (for example, Visual Studio) have been known to attach large headers or footers to a source file, but not display those headers/footers when the file is opened in the IDE. Use a plain text editor like Notepad or Emacs to open several typical source files and see if the contents match your expectations.

Finally, the PSP makes a blanket recommendation: if you aren't certain whether a particular line should be counted, try running your metrics both ways, then see which approach gives you a better correlation between size and effort.

As an example, this technique was used by the dashboard development team to determine that curly braces appearing on their own line usually should not be included in LOC counts of C-style languages. When you think about this, it becomes clear why. Consider the following three code fragments:

  if (test)
    then-clause;  
  else
    else-clause;  
  if (test) {  
    then-clause;  
  } else {  
    else-clause;  
  }
  if (test)
  {
    then-clause;  
  }
  else
  {
    else-clause;  
  }

For the purposes of this discussion, we aren't concerned about which of these fragments is "superior." We just want to think about how we should count the LOC they contain.

As written, the three fragments are semantically equivalent, differing only in the appearance and placement of the curly braces. The bulk of the programming effort will be invested in writing the test expression and the two conditional clauses; thus, we would expect these three fragments of code to require virtually identical effort to produce from scratch. To produce the best correlation between size and effort, we would ideally like them to have the same size measurement.

If we were to count every nonblank line as a line of code, then the third code fragment would allegedly be twice the size as the first (allegedly requiring twice as much effort to produce). That is counterintuitive and can be rejected straightaway. However, if we simply ignore any curly brace that appears on a line by itself, we will receive identical LOC counts from all three fragments. In this way, ignoring lone curly braces allows us to correct for varying whitespace styles across a large code base, improving the quality of our size measurements.

Once you have decided which lines of code you wish to count, you're ready for the next step.

Step 2: Create an XML definition for your language filter

The dashboard supports a simple declarative XML syntax for the definition of simple language filters. The discussion below assumes that you are familiar with XML. If this is not the case, please go read a nice tutorial on XML before proceeding!!

Here is an example of an XML description file for the C-style language filter:

<?xml version='1.0'?>

<dashboard-process-template>

   <locFilter id="C" fileSuffixes=".c .cpp .c++ .h .java .cs">
      <commentSyntax beginsWith="//" />
      <commentSyntax beginsWith="/*" endsWith="*/" />

      <stringSyntax delimiter="&quot;" escapeChar="\" />
      <stringSyntax delimiter="&apos;" escapeChar="\" />
      <stringSyntax beginsWith="@&quot;" endsWith="&quot;" mayInclude="&quot;&quot;" />

      <possibleFirstLine>#include</possibleFirstLine>
      <possibleFirstLine>#define</possibleFirstLine>
      <possibleFirstLine>#if</possibleFirstLine>
      <possibleFirstLine>#pragma</possibleFirstLine>

      <ignoreLine equalTo="{" />
      <ignoreLine equalTo="}" />
   </locFilter>

</dashboard-process-template>

The file begins and ends with the header and footer (which appear in blue above). Between the header and footer, any number of LOC filters can be defined through the use of the <locFilter> element (please note that capitalization is significant). Note that if you are also creating custom process definitions, you can combine custom process <template> definitions and custom <locFilter> definitions in the same XML description file, mixing the tags in any order.

Step 2(a): Assign a Unique ID

Each LOC filter must have a unique identifier, which is specified in the id attribute of the locFilter tag. This identifier will be displayed to end users in the LOC report, so it is best to choose a human-readable string. If you reuse the same ID as a LOC filter built-in to the dashboard (i.e., C, Sh, Pascal, SQL, Cobol, or Default), your definition will replace the built-in filter.

Step 2(b): List Typical Filename Suffixes

When you run a LOC comparison report, the dashboard looks at each file being compared and attempts to determine which language filter is most appropriate - a process called "LOC Filter Selection." This process involves several steps, but the first step is to look at the filename. You can use the fileSuffixes attribute to list the filename suffixes (separated by spaces) that are commonly used for source code files in the programming language you are describing.

Step 2(c): Define Comment Syntax

For the purposes of LOC counting, comment syntax is the most important difference between the various language filters. You can declare any number of comment styles for a language filter by creating embedded <commentSyntax> elements.

Each <commentSyntax> element must have a beginsWith attribute, indicating the string of characters that begins a comment. An endsWith attribute can also be specified, indicating the sequence of characters that ends the comment. If no endsWith attribute is given, the "end-of-line" character will be assumed; thus, single-line comment styles can be specified simply by a beginsWith attribute.

If your programming language has blocks of auto-generated code that begin with a distinctive sequence of characters, you may be able to use a <commentSyntax> element to describe them. Any matching blocks will be excluded from LOC counts along with the other comments in the file.

Technically, you could choose not to list any <commentSyntax> elements. The resulting LOC filter would describe a language without comments, or alternatively, a language where comments are included in LOC counts. The "Default" filter is a preinstalled example of this.

Step 2(d): Define String Literal Syntax

The LOC counter is primarily focused on identifying comments so those can be excluded from the count. However, there are occasions when a comment indicator can appear inside a string literal. Here is an example taken from a C program:

 
  printf("Visit http://example.com/ for more information");

In this line of code, the "//" that appears inside the string literal could be mistaken as the start of a single-line comment, which would prevent any subsequent changes on the line from being considered toward the "modified" count. To avoid this type of mistake, you can tell the LOC counter about the syntax for string literals in your programming language.

You can declare any number of string literal styles for a language filter by creating embedded <stringSyntax> elements. Each <stringSyntax> element must have one of the following:

An escapeChar attribute can optionally be specified, indicating a character (such as \) that is used to escape the following character inside the string.

In addition, a mayInclude attribute can optionally be specified to indicate a sequence of characters that are treated specially within the string literal and which should not signal the end of the string. A common example would be doubled-up delimiter characters, which occur in languages like Visual Basic. If your language allows the newline character to appear within a string literal, you can specify "\n" for this attribute; otherwise the LOC counter will treat the end-of-line as the end of an improperly formed string constant.

These attributes provide support for the vast majority of string literal constructs in the most popular languages. Some languages (for example, Perl) provide very flexible string quoting mechanisms that would become extremely complex to recognize without implementing a parser. If you use a language whose string rules cannot be expressed with this XML syntax, keep in mind that the LOC counter will not be able to detect comment indicators that appear within the more complex string constructs. Fortunately, this is fairly rare; however, you should be wary of comment indicators that appear inside string literals, and adjust your LOC counts if you know these are present.

Step 2(e): List Possible First Lines (Optional)

During the "LOC Filter Selection" process, if a file is encountered whose filename doesn't match any recognized suffix, the dashboard will look inside the file in an attempt to guess which language it contains.

Since it is very common for source code files to begin with a descriptive comment, the dashboard will first check to see if the first non-whitespace characters in the file appear to match any known comment syntax (specified via a <commentSyntax> element as described above).

If a particular file under comparison does not have a recognizable file suffix and does not begin with a recognizable comment, the "LOC Filter Selection" process might still be stumped. In that case, you can optionally list other strings that could legally appear at the beginning of a source code file. This is done by including embedded <possibleFirstLine> elements. If the first line of the file begins with one of these strings, this LOC filter may be considered a match.

<possibleFirstLine> elements are completely optional; most language filters will not need to use them. In particular, remember that these are only used as a last resort; normally files will be categorized based on their filename suffix.

Step 2(f): List Ignored Lines (Optional)

In addition to comments, there may be other lines of code that you wish to exclude from LOC counts. These can be specified by creating embedded <ignoreLine> elements. Each <ignoreLine> element must have exactly one of the following attributes:

The lines of code in the file will be matched against these patterns; if any pattern matches, the line will not be included in LOC counts. Keep in mind that lines are tested one at at time; therefore, multiline patterns are not supported.

Step 3: Save your XML language filter definition

Once your XML language filter definition is complete, it is time to save it somewhere where the dashboard can find it. Follow these steps:

  1. Choose " → Help → About," then click the "Configuration" tab. Just above the list of add-ons, a paragraph will tell you where the Process Dashboard is installed. Find that directory on your computer. It should contain a small number of files, including one called "pspdash.jar".
  2. Underneath this directory, create a subdirectory called "Templates".
  3. Save your XML language filter definition in that Templates directory. Important: you must give your XML language filter definition a filename ending with "-template.xml" to signal to the dashboard that the file contains dashboard metadata. If you do not choose a filename ending with "-template.xml", your language filter definition will be ignored.

This is the simplest place to put your XML language filter definition. Of course, if you want to share your LOC counting filter with team members, you may want to put it somewhere else. See the section on how the dashboard finds process files for more information. Keep in mind that the dashboard will search for "-template.xml" files in the "Templates" directories only - subdirectories will not be searched.

Step 4: Restart the dashboard

Once your XML language filter definition is in place, it is necessary to shut down the dashboard and restart it for the new filter to be found. Once found, it will automatically be included in the "LOC Filter Selection" process when a LOC report is run. If you make changes to your LOC filter definition, a dashboard restart will also be required for your changes to take effect.