Recently I investigated how encryption (document password protection) works in OpenOffice. My investigation was motivated partially by my own curiosity and by a practical need - I wanted to be able to view an encrypted file through a text based SSH connection without the overhead of bringing up the OpenOffice GUI. The end result is this page explaining what I found and
<a href="/encryption/oodecr">oodecr</a>, a shell script that does the decryption. Keep in mind that I'm not a cryptographer. I just find this stuff interesting.
As you may know OpenOffice files are ZIP archives with extensions such as
ods instead of
zip. Unzipping OpenOffice files reveals that they contain various XML files the most important of which is
content.xml. When OpenOffice files are password protected the XML files are have the same name, but their contents are seemingly random garbage since they are encrypted.
What I described above is explained in much greater detail in the OASIS OpenDocument v 1.1 standard. The part about encryption is in section 17.3 which I've quoted below. I've emphasized it to distinguish it from my comments which are mixed in.
The encryption process takes place in the following multiple stages:
1.A 20-byte SHA1 digest of the user entered password is created and passed to the package component.
This seems redundant since PBKDF2 (Password-Based Key Derivation Function) already applies SHA1 many times.
2.The package component initializes a random number generator with the current time.
This is probably sufficient, but it seems seeding the random number generator with additional random bytes, such as the PID or bytes from
/dev/urandom would have been simple relative to the benefit it would provide. The OpenOffice random number generator API is
3.The random number generator is used to generate a random 8-byte initialization vector and 16-byte salt for each file.
The initialization vector and the salt can be found in
4.This salt is used together with the 20-byte SHA1 digest of the password to derive a unique 128-bit key for each file. The algorithm used to derive the key is PBKDF2 using HMAC-SHA-1 (see [RFC2898]) with an iteration count of 1024.
It can be difficult to find a suitable implementation of PBKDF2 that will correctly handle binary passwords (passwords that contain non-ASCII characters). For example, the relevant Java methods such as the following:
import javax.crypto.*; import javax.crypto.spec.*; SecretKeyFactory keyFactory = SecretKeyFactory.getInstance("PBKDF2WithHmacSHA1"); PBEKeySpec pbKeySpec = new PBEKeySpec(password.toCharArray(), salt, 1024, 128); SecretKey pbKey = keyFactory.generateSecret(pbKeySpec); byte encoded = pbKey.getEncoded();
do not work correctly with non-ASCII passwords. OpenOffice's
rtl_digest_PBKDF2() does correctly handle binary passwords. The
pbkdf2 binary included with
oodecr is just a wrapper for
5.The derived key is used together with the initialization vector to encrypt the file using the Blowfish algorithm in cipher-feedback (CFB) mode.
Blowfish seems to be a reasonably well respected cipher that has not been broken.
Each file that is encrypted is compressed before being encrypted. To allow the contents of the package file to be verified, it is necessary that encrypted files are flagged as 'STORED' rather than 'DEFLATED'. As entries which are 'STORED' must have their size equal to the compressed size, it is necessary to store the uncompressed size in the manifest. The compressed size is stored in both the local file header and central directory record of the Zip file.
The "compressed" above refers to the deflate compression algorithm. The plain text (input to the encryption) is only the deflate compressed data without any additional headers. Once the data is decrypted a gzip file (same compression algorithm) can be formed by adding a gzip header and footer. The rest of the above quoted text has to do with the way ZIP files are laid out, which is not what this page is about.
The manifest:checksum-type attribute specifies the name of digest algorithm that can be used to check password correctness. Currently, the only supported digest algorithm is SHA1.
The above, which is in section 17.7.4, seems to be about a close as the standard gets to explaining the SHA1/1K password check. SH1A1/1K means that the SHA1 of the first 1024 bytes of the decrypted
content.xml (which is deflate compressed data) is compared to the
META-INF/manifest.xml. It it matches the password was almost certainly correct. This seems ok to me, but there may be some corner cases where it could leak information about the plaintext document. For example, if the attacker is able come up with a close guess for the initial part of the password plaintext document (perhaps the document mostly consist of a known header, or the attacker has an earlier version) he/she may be able to try variations of the document until the SHA1/1K is matched. Either adding random bytes to the start of the plaintext
content.xml or encrypting the SHA1/1K with the same blowfish algorithm and key would help.