JCols

Updates as of 2011-06-04

  • 0.9.8 - Adjusted the class path so that JCols is more likely to find the required rhino.jar file.
  • 0.9.9 - Added the "-f" option.

The following documentation is very similar to the documentation that can be found in the README file in the JAR file.

Overview

Skip forward to the download section for instructions on downloading JCols if you'd rather not read this.

This utility parses text files by applying a user specified expression to each line. The line may first be split into columns by whitespace (the default) or by some regular expression (the -r option). The name "JCols" is simply an abbreviation of "Java Columns". This documentation can also be found in the README file included in the JAR file.

This utility is an alternative to things like AWK. The utility only has a command line interface. If you're interested in a application with a GUI to parse text files consider something like FlexText or File Parse, but I have not tried either of those programs. If you want a Java implementation of AWK look into jawk. Finally, in some cases the cut command or even bash/sh's set $line to extract fields from a line may be sufficient.

The distinguishing characteristics of JCols are that it is a small utility that allows arbitrary expressions in relatively familiar languages (Java and JavaScript) to be specified. Also, JCols' -r option, which allows arbitrary regular expression groups to be mapped to columns, does not seem to have an analogue in AWK.

Part of the motivation behind this script was to find the fastest way that a user defined expression could be applied to each line in a file. Java was chosen as the host language due to it's speed and portability. JavaScript was chosen as the embedded language due to it's straight forward syntax and due to it being well supported in Java. In some cases this has proven to be 20 times as fast as the equivalent Python utility evaluating Python expressions on each line.

This utility depends on Rhino for JavaScript support. Directly accessing Rhino instead of going through Java 1.6's generic interface was determined to be much faster.

Also, if the -j option is used the user specified expression is assumed to be Java instead of JavaScript. For Java expressions are compiled with the handle to the compiler returned by javax.tools.ToolProvider.getSystemJavaCompiler(). After the roughly one second of time required to compile the Java expression when Java expressions are use JCols is almost as fast as GNU AWK.

A future version of JCols may attempt to sandbox the expressions by using something like policy files in Java. Until then don't run untrusted expressions.

Requirements

JCols requires Java 6 or later. The JDK, not just the JRE, is required to use Java expressions. Rhino is required to use JavaScript expressions. A future version of JCols may dynamically load Rhino if and only if a JavaScript expression is used. Until then rhino.jar is always required. See the the install section for help with getting rhino.jar to work.

Download

The latest JAR file may be downloaded by clicking here. The source code is included in the JAR file.

Install

Copy the downloaded JAR file to wherever you like to keep JAR files. Optionally rename the file to simply jcols.jar. Also, optionally make the JAR executable if your system supports executable JAR files. Finally, optionally make sure the executable JAR file is in your path. The source code as well as the Eclipse project file is included in the JAR file.

Since JCols depends on Rhino there are at least three different ways that you can make Rhino available to JCols.

  1. On Linux simply install the rhino package so that /usr/share/java/rhino.jar exists as that is included in the class path in the manifest of jcols.jar:
    > yum install rhino

    On other operating systems where this path is not correct edit the manifest of jcols.jar, or better yet move on to one of the remaining steps.

  2. It's possible to create a directory in the same directory as jcols.jar named jocls-lib and then place rhino.jar within that directory:
    > ls -1 jcols.jar
    jcols.jar
    > mkdir jcols-lib
    > cp rhino.jar jcols-lib
  3. Include rhino.jar within the ext directory of your JRE. Although this is not the recommended solution due to possible side effects it was once the way it worked in Fedora, and it still seems to work:
    > cp -i rhino.jar /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/ext

    Note the -i to prevent overriding an existing rhino.jar. Also note that the location of the ext directory depends on the JRE used.

Syntax

jcols.jar should be used as as a pipe (it takes text as standard input and writes text to standard output). Ideally the system should have executable JAR files in which case jcols.jar can simple be invoked as shown in the examples. If not the jcols.jar should be replaced with java -jar jcols.jar. The usage, which can be seen by running jcols.jar without any arguments, is:

Usage: jcols.jar [options] Java[Script]-expression
Options:
  --                  The remaining arguments are non-option arguments.
  -f input-file       Read from an input file instead of stdin.
  -F split-expression Regular expression used as a delimiter to split the line
                      (like IFS)
  -d                  For Java use a default of 0 for malformed numbers.
                      Slower.
  -h                  This help message
  -i                  Ignore input lines that cause exceptions.  Slower.
  -j                  Java expression instead of JavaScript
  -k                  Keep temporary files, if any
  -l                  For Java use 'long' instead of 'int'
  -n join-string      Join a comma separated list of strings with the specified
                      join-string
  -r group-expression Regular expression with groups for columns used to match
                      the columns
  -t                  Trim leading and trailing whitespace.  Slower.
  -v                  Verbose

The Java[Script]-expression is evaluated either within a JavaScript function for JavaScript or within a method in a generated Java class for Java. The function or method is called once for each line of input. Options may be combined (ex: -jk instead of -j -k) and specified in any order. Options that take an argument, such as -F and -r, require whitespace between the option and the argument. The function or method has the following parameters:

  • n - The line number as an integer.
  • l - The line as a string.
  • c - The columns as an array of strings.

Additionally the following shortcut means of accessing columns is supported (column numbers start at zero for the left most column):

  • a_<column number> - The running numeric total for the column specified (accumulator).
  • n_<column number> - The numeric value for the column specified.
  • s_<column number> - The string value for the column specified.
  • o[ans]_<column number> - The value of the specified column on the previous line.

Examples

The following examples use a simple data file:

> cat some-table
aa1 bb1 cc1
aa2 bb2 cc2
aa3 bb3 cc3

Print out the line number and length of each line separated by a tab:

> jcols.jar 'n + "\t" + l.length' < some-table
1   11
2   11
3   11

Print out only the first and third columns with a ":" separating them. Convert the result to upper case Also, specify an input file (-f) rather than
reading from stdin:

> jcols.jar -f some-table '(s_0 + ":" + s_2).toUpperCase()'
AA1:CC1
AA2:CC2
AA3:CC3

Note: For the previous example it also would have worked if the -j option (Java expression instead of JavaScript) was specified since the expression is valid for both Java and JavaScript.

For the first two columns concatenate the column with its old value. Note that "null" is used for the first row when there is no old value. Also note that any column reference may be prefixed with "o" to get the old value:

> jcols.jar -j 'String.format("%s:%s %s:%s", s_0, os_0, s_1, os_1)' < \
      some-table
aa1:null bb1:null
aa2:aa1 bb2:bb1
aa3:aa2 bb3:bb2

Note: See the final example for a demonstration of how the join (-n) option allows for a succinct alternative to the "String.format" used in the previous example.

Use the -v (verbose) and -k (keep temporary files) to see and inspect the generated Java source for the previous example:

> jcols.jar -jkv 'String.format("%s:%s %s:%s", s_0, os_0, s_1, os_1)' < \
      some-table | head -2
Classpath URL    : file:/tmp/JCols-103762802/
Generated Source : /tmp/JCols-103762802/org/selliott/jcols/LFilter.java

Extract the number that suffixes the second and third columns (the 1, 2, 3 ...) and output the numbers as well as a running total (accumulator). Prefix each output columns with descriptions such as accum_2=:

> jcols.jar -r '\S+?(\d+)\s+\S+?(\d+)\s+' '"num_col_2=" + s_0 + " accum_col_2=" + a_0 + \
    "  num_col_3=" + s_0 + " accum_col_3=" + a_0' < some-table 
num_col_2=1 accum_col_2=1  num_col_3=1 accum_col_3=1
num_col_2=2 accum_col_2=3  num_col_3=2 accum_col_3=3
num_col_2=3 accum_col_2=6  num_col_3=3 accum_col_3=6

Split some-table with letters and white space leaving only numbers and then print out the first two non-blank columns (like with AWK a -F option that matches the prefix of a line results in an initial blank column):

> jcols.jar -jF '[A-Za-z\s]+' '"|" + s_1 + ":" + s_2 + "|"' < some-table 
|1:1|
|2:2|
|3:3|

Attempt to convert the first two columns to integers. Since some-table is not made up of valid integers first map failed attempts to parse integers to a default (-d) value of 0. Finally ignore (-i) any exceptions by quietly continuing to the next line:

> jcols.jar -jdi 'n_0 + " " + n_1' < some-table
0 0
0 0
0 0

Illustrate calling into the Java (-j) standard library by prefixing each line with the current time in nanoseconds:

> jcols.jar -j 'System.nanoTime() + " " + l' < some-table 
161436268625924 aa1 bb1 cc1
161436268713208 aa2 bb2 cc2
161436268725456 aa3 bb3 cc3

Effectively grep out a subset of the lines by returning the entire line (l) if some criteria is met (if the line ends with a "2"). Return null for lines that don't meet the criteria:

> jcols.jar -j 'l.endsWith("2") ? l : null' < some-table 
aa2 bb2 cc2

Take an expression that consists of a comma separated list of strings and join those strings with a join string (-n) equal to the tab character. Also trim off any leading and trailing whitespace (-t) from the line before any additional processing on it:

> jcols.jar -tn \\t 's_0, s_1, s_2' < some-table 
aa1	bb1	cc1
aa2	bb2	cc2
aa3	bb3	cc3

Programmatic Use

Although it is possible to use JCols programmatically from Java programs additional work needs to be done to make it easier. Consult org.selliott.jcols.JCols.main() for an example of creating an instance of JCols and then calling the process() method on that instance.

License

This utility is covered by the GPL license version 2 or later. A copy of the license is included in jcols.jar file. The GPL version 2 license can also be found at http://www.gnu.org/licenses/gpl-2.0.txt

Contact

Feedback may be given by posting comments here. The latest version of this code should always be here.