Updates as of 2011-06-04
The following documentation is very similar to the documentation that can be found in the README file in the JAR file.
Overview
Skip forward to the download section for instructions on downloading JCols if you'd rather not read this.
This utility parses text files by applying a user specified expression to each line. The line may first be split into columns by whitespace (the default) or by some regular expression (the -r option). The name "JCols" is simply an abbreviation of "Java Columns". This documentation can also be found in the README file included in the JAR file.
This utility is an alternative to things like AWK. The utility only has a command line interface. If you're interested in a application with a GUI to parse text files consider something like FlexText or File Parse, but I have not tried either of those programs. If you want a Java implementation of AWK look into jawk. Finally, in some cases the cut command or even bash/sh's set $line to extract fields from a line may be sufficient.
The distinguishing characteristics of JCols are that it is a small utility that allows arbitrary expressions in relatively familiar languages (Java and JavaScript) to be specified. Also, JCols' -r option, which allows arbitrary regular expression groups to be mapped to columns, does not seem to have an analogue in AWK.
Part of the motivation behind this script was to find the fastest way that a user defined expression could be applied to each line in a file. Java was chosen as the host language due to it's speed and portability. JavaScript was chosen as the embedded language due to it's straight forward syntax and due to it being well supported in Java. In some cases this has proven to be 20 times as fast as the equivalent Python utility evaluating Python expressions on each line.
This utility depends on Rhino for JavaScript support. Directly accessing Rhino instead of going through Java 1.6's generic interface was determined to be much faster.
Also, if the -j option is used the user specified expression is assumed to be Java instead of JavaScript. For Java expressions are compiled with the handle to the compiler returned by javax.tools.ToolProvider.getSystemJavaCompiler(). After the roughly one second of time required to compile the Java expression when Java expressions are use JCols is almost as fast as GNU AWK.
A future version of JCols may attempt to sandbox the expressions by using something like policy files in Java. Until then don't run untrusted expressions.
Requirements
JCols requires Java 6 or later. The JDK, not just the JRE, is required to use Java expressions. Rhino is required to use JavaScript expressions. A future version of JCols may dynamically load Rhino if and only if a JavaScript expression is used. Until then rhino.jar is always required. See the the install section for help with getting rhino.jar to work.
The latest JAR file may be downloaded by clicking here. The source code is included in the JAR file.
Copy the downloaded JAR file to wherever you like to keep JAR files. Optionally rename the file to simply jcols.jar. Also, optionally make the JAR executable if your system supports executable JAR files. Finally, optionally make sure the executable JAR file is in your path. The source code as well as the Eclipse project file is included in the JAR file.
Since JCols depends on Rhino there are at least three different ways that you can make Rhino available to JCols.
> yum install rhino
On other operating systems where this path is not correct edit the manifest of jcols.jar, or better yet move on to one of the remaining steps.
> ls -1 jcols.jar jcols.jar > mkdir jcols-lib > cp rhino.jar jcols-lib
> cp -i rhino.jar /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/ext
Note the -i to prevent overriding an existing rhino.jar. Also note that the location of the ext directory depends on the JRE used.
Syntax
jcols.jar should be used as as a pipe (it takes text as standard input and writes text to standard output). Ideally the system should have executable JAR files in which case jcols.jar can simple be invoked as shown in the examples. If not the jcols.jar should be replaced with java -jar jcols.jar. The usage, which can be seen by running jcols.jar without any arguments, is:
Usage: jcols.jar [options] Java[Script]-expression Options: -- The remaining arguments are non-option arguments. -f input-file Read from an input file instead of stdin. -F split-expression Regular expression used as a delimiter to split the line (like IFS) -d For Java use a default of 0 for malformed numbers. Slower. -h This help message -i Ignore input lines that cause exceptions. Slower. -j Java expression instead of JavaScript -k Keep temporary files, if any -l For Java use 'long' instead of 'int' -n join-string Join a comma separated list of strings with the specified join-string -r group-expression Regular expression with groups for columns used to match the columns -t Trim leading and trailing whitespace. Slower. -v Verbose
The Java[Script]-expression is evaluated either within a JavaScript function for JavaScript or within a method in a generated Java class for Java. The function or method is called once for each line of input. Options may be combined (ex: -jk instead of -j -k) and specified in any order. Options that take an argument, such as -F and -r, require whitespace between the option and the argument. The function or method has the following parameters:
Additionally the following shortcut means of accessing columns is supported (column numbers start at zero for the left most column):
Examples
The following examples use a simple data file:
> cat some-table aa1 bb1 cc1 aa2 bb2 cc2 aa3 bb3 cc3
Print out the line number and length of each line separated by a tab:
> jcols.jar 'n + "\t" + l.length' < some-table 1 11 2 11 3 11
Print out only the first and third columns with a ":" separating them. Convert the result to upper case Also, specify an input file (-f) rather than
reading from stdin:
> jcols.jar -f some-table '(s_0 + ":" + s_2).toUpperCase()' AA1:CC1 AA2:CC2 AA3:CC3
Note: For the previous example it also would have worked if the -j option (Java expression instead of JavaScript) was specified since the expression is valid for both Java and JavaScript.
For the first two columns concatenate the column with its old value. Note that "null" is used for the first row when there is no old value. Also note that any column reference may be prefixed with "o" to get the old value:
> jcols.jar -j 'String.format("%s:%s %s:%s", s_0, os_0, s_1, os_1)' < \ some-table aa1:null bb1:null aa2:aa1 bb2:bb1 aa3:aa2 bb3:bb2
Note: See the final example for a demonstration of how the join (-n) option allows for a succinct alternative to the "String.format" used in the previous example.
Use the -v (verbose) and -k (keep temporary files) to see and inspect the generated Java source for the previous example:
> jcols.jar -jkv 'String.format("%s:%s %s:%s", s_0, os_0, s_1, os_1)' < \ some-table | head -2 Classpath URL : file:/tmp/JCols-103762802/ Generated Source : /tmp/JCols-103762802/org/selliott/jcols/LFilter.java
Extract the number that suffixes the second and third columns (the 1, 2, 3 ...) and output the numbers as well as a running total (accumulator). Prefix each output columns with descriptions such as accum_2=:
> jcols.jar -r '\S+?(\d+)\s+\S+?(\d+)\s+' '"num_col_2=" + s_0 + " accum_col_2=" + a_0 + \ " num_col_3=" + s_0 + " accum_col_3=" + a_0' < some-table num_col_2=1 accum_col_2=1 num_col_3=1 accum_col_3=1 num_col_2=2 accum_col_2=3 num_col_3=2 accum_col_3=3 num_col_2=3 accum_col_2=6 num_col_3=3 accum_col_3=6
Split some-table with letters and white space leaving only numbers and then print out the first two non-blank columns (like with AWK a -F option that matches the prefix of a line results in an initial blank column):
> jcols.jar -jF '[A-Za-z\s]+' '"|" + s_1 + ":" + s_2 + "|"' < some-table |1:1| |2:2| |3:3|
Attempt to convert the first two columns to integers. Since some-table is not made up of valid integers first map failed attempts to parse integers to a default (-d) value of 0. Finally ignore (-i) any exceptions by quietly continuing to the next line:
> jcols.jar -jdi 'n_0 + " " + n_1' < some-table 0 0 0 0 0 0
Illustrate calling into the Java (-j) standard library by prefixing each line with the current time in nanoseconds:
> jcols.jar -j 'System.nanoTime() + " " + l' < some-table 161436268625924 aa1 bb1 cc1 161436268713208 aa2 bb2 cc2 161436268725456 aa3 bb3 cc3
Effectively grep out a subset of the lines by returning the entire line (l) if some criteria is met (if the line ends with a "2"). Return null for lines that don't meet the criteria:
> jcols.jar -j 'l.endsWith("2") ? l : null' < some-table aa2 bb2 cc2
Take an expression that consists of a comma separated list of strings and join those strings with a join string (-n) equal to the tab character. Also trim off any leading and trailing whitespace (-t) from the line before any additional processing on it:
> jcols.jar -tn \\t 's_0, s_1, s_2' < some-table aa1 bb1 cc1 aa2 bb2 cc2 aa3 bb3 cc3
Programmatic Use
Although it is possible to use JCols programmatically from Java programs additional work needs to be done to make it easier. Consult org.selliott.jcols.JCols.main() for an example of creating an instance of JCols and then calling the process() method on that instance.
License
This utility is covered by the GPL license version 2 or later. A copy of the license is included in jcols.jar file. The GPL version 2 license can also be found at http://www.gnu.org/licenses/gpl-2.0.txt
Contact
Feedback may be given by posting comments here. The latest version of this code should always be here.