JCols 0.9.9 by Steven L. Elliott OVERVIEW This program parses text files by applying a user specified expression to each line. The name "JCols" is simply an abbreviation of "Java Columns". The line may first be split into columns by whitespace (the default) or by some regular expression (the "-r" command line switch). This utility is an alternative to things like AWK. This documentation can be found both within JCols.java, in the README file and at http://selliott.org/utilities/jcols Part of the motivation behind this script was to find the fastest way that a user defined expression could be applied to each line in a file. Java was chosen as the host language due to it's speed and portability. JavaScript was chosen as the embedded language due to it's straight forward syntax and due to it being well supported in Java. In some cases this has proven to be 20 times as fast as the equivalent Python program evaluating Python expressions on each line. This program depends on Rhino for JavaScript support. Directly accessing Rhino instead of going through Java 1.6's generic interface was determined to be much faster. Since JCols depends on Rhino there are at least three different ways that you can make Rhino available to JCols. 1) On Linux simply install the "rhino" package so that /usr/share/java/rhino.jar exists as that is included in the class path in the manifest of jcols.jar: > yum install rhino On other operating systems where this path is not correct edit the manifest of jcols.jar, or better yet move on to one of the remaining steps. 2) It's possible to create a directory in the same directory as jcols.jar named jocls-lib and then place rhino.jar within than directory: > ls -1 jcols.jar jcols.jar > mkdir jcols-lib > cp rhino.jar jcols-lib 3) Include rhino.jar within the "ext" directory of your JRE. Although this is not the recommended solution due to possible side effects it was once the way it worked in Fedora, and it still seems to work: > cp -i rhino.jar /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64\ /jre/lib/ext Note that the "-i" to prevent overriding an existing rhino.jar. Also note that the location of the "ext" directory depends on the JRE used. SYNTAX jcols.jar should be used as as a pipe (it takes text as standard input and writes output to standard output). Ideally the system will have executable jar files in which case the class name can be simply be replaced with "jcols.jar". If not the class name should be replaced with "java -jar jcols.jar". The usage is: Usage: jcols.jar [-f input-file] [-F] [-j] [-k] [-l] [-r reg-expression] \ [-v] [--] Java[Script]-expression Options: -f input-file Read from an input file instead of stdin. -F split-expression Regular expression used as a delimiter to split the line (like IFS) -d For Java use a default of 0 for malformed numbers. Slower. -h This help message -i Ignore input lines that cause exceptions. Slower. -j Java expression instead of JavaScript -k Keep temporary files, if any -l For Java use 'long' instead of 'int' -r group-expression Regular expression with groups for columns used to match the columns -v Verbose The JavaScript-expression is evaluated as the within a JavaScript function. The JavaScript function is called once for each line of input. The function has the following parameters: n - The line number as an integer. l - The line as a string. c - The columns as an array of strings. Additionally the following shortcut means of accessing columns is supported: a_ - The running numeric total for the column specified. (accumulator). n_ - The numeric value for the column specified. s_ - The string value for the column specified. EXAMPLES The following examples use a simple data file: > cat some-table aa1 bb1 cc1 aa2 bb2 cc2 aa3 bb3 cc3 Print out the line number and length of each line separated by a tab: > jcols.jar 'n + "\t" + l.length' < some-table 1 11 2 11 3 11 Print out only the first and third columns with a ":" separating them converted to upper case. Also, specify an input file (-f) rather than reading from stdin: > jcols.jar -f some-table '(s_0 + ":" + s_2).toUpperCase()' AA1:CC1 AA2:CC2 AA3:CC3 Note for the previous example it also would have worked if the '-j' switch (Java expression instead of JavaScript) was specified since the expression is valid for both Java and JavaScript. For the first two columns concatenate the column with its old value. Note that "null" is used for the first row when there is no old value. Also note that any column reference may be prefixed with "o" to get the old value: > jcols.jar -j 'String.format("%s:%s %s:%s", s_0, os_0, s_1, os_1)' < \ some-table aa1:null bb1:null aa2:aa1 bb2:bb1 aa3:aa2 bb3:bb2 Note: See the final example for a demonstration of how the join (-n) option allows for a succinct alternative to the "String.format" used in the previous example. Use the '-v' (verbose) and '-k' (keep temporary files) to see and inspect the generated Java source for the previous example: > jcols.jar -jkv 'String.format("%s:%s %s:%s", s_0, os_0, s_1, os_1)' < \ some-table | head -2 Classpath URL : file:/tmp/JCols-103762802/ Generated Source : /tmp/JCols-103762802/org/selliott/jcols/LFilter.java Extract the number that suffixes the second and third columns (the "1", "2", "3" ...) and output the numbers as well as a running total (accumulator). Prefix each output columns with descriptions such as "accum_2=": > jcols.jar -r '\S+?(\d+)\s+\S+?(\d+)\s+' '"num_col_2=" + s_0 + " accum_col_2=" + a_0 + " num_col_3=" + s_0 + " accum_col_3=" + a_0' < some-table num_col_2=1 accum_col_2=1 num_col_3=1 accum_col_3=1 num_col_2=2 accum_col_2=3 num_col_3=2 accum_col_3=3 num_col_2=3 accum_col_2=6 num_col_3=3 accum_col_3=6 Split 'some-table' with letters and white space leaving only numbers and then print out the first two non-blank columns (like with AWK a "-F" switch that matches the prefix of a line results in an initial blank column): > jcols.jar -jF '[A-Za-z\s]+' '"|" + s_1 + ":" + s_2 + "|"' < some-table |1:1| |2:2| |3:3| Attempt to convert the first two columns to integers. Since "some-table" is not made up of valid integers first map failed attempts to parse integers to a default (-d) value of 0. Finally ignore (-i) any exceptions by quietly continuing to the next line. > jcols.jar -jdi 'n_0 + " " + n_1' < some-table 0 0 0 0 0 0 Illustrate calling into the Java (-j) standard library by prefixing each line with the current time in nanoseconds: > jcols.jar -j 'System.nanoTime() + " " + l' < some-table 161436268625924 aa1 bb1 cc1 161436268713208 aa2 bb2 cc2 161436268725456 aa3 bb3 cc3 Take an expression that consists of a comma separated list of strings and join those strings with a join string (-n) equal to the tab character. Also trim off any leading and trailing whitespace (-t) from the line before any additional processing on it: > jcols.jar -tn \\t 's_0, s_1, s_2' < some-table aa1 bb1 cc1 aa2 bb2 cc2 aa3 bb3 cc3 LICENSE This utility is covered by the GPL license version 2 or later. A copy of the license should included in the jar file in which this files was included. The GPL version 2 license can also be found at http://www.gnu.org/licenses/gpl-2.0.txt CONTACT Send comments and questions to Steven Elliott . The latest version of this code may be found on the author's website at http://selliott.org