My Notebook

How to implement part of the Open-Document-Speadsheet file format using only Bash

Author
Date
Category
Scripting

The Open-Document-Format is an open XML-based standard for spreadsheets, presentations and word processing. In its simplest form it is just a single XML-file in a ZIP-Archive. So there is no need for bloated libraries to create simple documents. All you need is a simple Bash-script and a few common Linux commands.

Background

Some time ago I gathered lots of performance test results in the form of hundreds of separate CSV files. I had to combine all those files into one big spreadsheet for simple data analysis and plotting. Every CSV file should get its own table in the resulting spreadsheet. It is easy to import a CSV file into Open Office, but it is a lot of tedious work to do it for hundreds of files. So I decided to write a script to do it for me.

Content of a minimal ODF file

All ODF files are basically just ZIP files, so you can open them with your favorite archive manager and see what's inside. A typical ODF file contains lots of configuration files, thumbnails and settings, but most of it is unnecessary.

$ unzip -l libreoffice.ods 
Archive:  libreoffice.ods
  Length      Date    Time    Name
---------  ---------- -----   ----
       46  2015-08-15 19:34   mimetype
     3163  2015-08-15 19:34   Thumbnails/thumbnail.png
     7431  2015-08-15 19:34   settings.xml
    14268  2015-08-15 19:34   content.xml
      844  2015-08-15 19:34   meta.xml
    10065  2015-08-15 19:34   styles.xml
        0  2015-08-15 19:34   Configurations2/images/Bitmaps/
        0  2015-08-15 19:34   Configurations2/popupmenu/
        0  2015-08-15 19:34   Configurations2/statusbar/
        0  2015-08-15 19:34   Configurations2/progressbar/
        0  2015-08-15 19:34   Configurations2/toolbar/
        0  2015-08-15 19:34   Configurations2/menubar/
        0  2015-08-15 19:34   Configurations2/floater/
        0  2015-08-15 19:34   Configurations2/accelerator/current.xml
        0  2015-08-15 19:34   Configurations2/toolpanel/
      899  2015-08-15 19:34   manifest.rdf
     1093  2015-08-15 19:34   META-INF/manifest.xml
---------                     -------
    37809                     17 files

Only two of these files are really needed:

$ unzip -l minimal.ods 
Archive:  minimal.ods
  Length      Date    Time    Name
---------  ---------- -----   ----
    14268  2015-08-15 19:34   content.xml
      398  2015-09-01 08:36   META-INF/manifest.xml
---------                     -------
    14666                     2 files
META-INF/manifest.xml
Contains the mime type and a list of files in the archive.
content.xml
The actual data encoded as XML.

All we need to do is create two XML files and zip them. A Bash-script is ideal for this task, but you can easily do it in any other programming language of your choice.

Creating the Manifest

The file META-INF/manifest.xml contains the mime type of the document and a list with all the files contained in the ZIP archive. The following function takes the mime type and a list of files as parameters and prints the manifest to STDOUT:

function printManifest() {
    echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"

	echo "<manifest:manifest "
	echo "xmlns:manifest=\"urn:oasis:names:tc:opendocument:xmlns:manifest:1.0\" "
	echo "manifest:version=\"1.2\">"

	# Add mime type
	echo "<manifest:file-entry manifest:full-path=\"/\" manifest:version=\"1.2\""
	echo "manifest:media-type=\"$1\"/>"
	shift

	# List all the files in the archive
	for file in "$@"; do
		echo "<manifest:file-entry manifest:full-path=\"$file\""
		echo "manifest:media-type=\"text/xml\"/>"
	done
	echo "</manifest:manifest>"
}

Encoding the Content

Creating the file content.xml can be divided into the following steps:

Creating the Root-Tag

All namespaces used in the document have to be declared in the root tag of the XML. We only need a few of them for our simple spreadsheet. The function printNamespaces() can be used to conveniently select a subset from the NAMESPACES map.

declare -A NAMESPACES=(
    [office]="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
	[style]="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
	[text]="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
	[table]="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
	[draw]="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
	[fo]="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
	[xlink]="http://www.w3.org/1999/xlink"
	[dc]="http://purl.org/dc/elements/1.1/"
	[meta]="urn:oasis:names:tc:opendocument:xmlns:meta:1.0"
	[number]="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
	[presentation]="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0"
	[svg]="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
	[chart]="urn:oasis:names:tc:opendocument:xmlns:chart:1.0"
	[dr3d]="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0"
	[math]="http://www.w3.org/1998/Math/MathML"
	[form]="urn:oasis:names:tc:opendocument:xmlns:form:1.0"
	[script]="urn:oasis:names:tc:opendocument:xmlns:script:1.0"
	[ooo]="http://openoffice.org/2004/office"
	[ooow]="http://openoffice.org/2004/writer"
	[oooc]="http://openoffice.org/2004/calc"
	[dom]="http://www.w3.org/2001/xml-events"
	[xforms]="http://www.w3.org/2002/xforms"
	[xsd]="http://www.w3.org/2001/XMLSchema"
	[xsi]="http://www.w3.org/2001/XMLSchema-instance"
	[rpt]="http://openoffice.org/2005/report"
	[of]="urn:oasis:names:tc:opendocument:xmlns:of:1.2"
	[xhtml]="http://www.w3.org/1999/xhtml"
	[grddl]="http://www.w3.org/2003/g/data-view"
	[tableooo]="http://openoffice.org/2009/table"
	[drawooo]="http://openoffice.org/2010/draw"
	[calcext]="urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0"
	[loext]="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0"
	[field]="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0"
	[formx]="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0"
	[css3t]="http://www.w3.org/TR/css3-text/"
)

function printNamespaces() {
	for ns in "$@"; do
		echo "xmlns:$ns=\"${NAMESPACES[$ns]}\""
	done
}

function printHeader() {
	echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
	echo "<office:document-content office:version=\"1.2\" "

	printNamespaces office table text

	echo "><office:body><office:spreadsheet>"
}

Parsing of the CSV Input and Table Creation

The function printContent() takes a CSV file as its parameter and prints a XML table to STDOUT. The CSV file is read in line by line. Every line is split into individual cells and every cell then is enclosed in a <table:table-cell> tag.

function printRow() {
    echo "<table:table-row>"

	for cell in $line; do
		# Use bash regex to check if it is a number
		re='^[0-9.]+$'
		if [[ "$cell" =~ $re ]]; then
			echo "<table:table-cell office:value-type=\"float\""
			echo "office:value=\"$cell\"/>"
		else
			echo "<table:table-cell office:value-type=\"string\">"
			echo "<text:p>$cell</text:p>"
			echo "</table:table-cell>"
		fi
	done

	echo "</table:table-row>"
}

function printContent() {
	# Create a separate table for every input file
	echo "<table:table table:name=\"$(basename $1 .csv)\">"

	# Read the input file line by line
	while IFS='' read -r line || [[ -n "$line" ]]; do
		IFS=';' printRow "$line"
	done < "$1"

	echo "</table:table>"
}

Closing Open Tags

The following trivial piece of code closes all remaining open tags:

function printFooter() {
    echo "</office:spreadsheet></office:body></office:document-content>"
}

Everything put together

The following piece of code uses all the functions defined above to create an ODS file from multiple CSV files. It uses a temporary directory to create the directory structure of the ODS file and then adds everything to a ZIP file.

#!/bin/bash

OUTPUT="$(readlink -f $1)"
shift

TMP=`mktemp -d`
MANIFEST=$TMP/META-INF/manifest.xml
CONTENT=$TMP/content.xml
MIMETYPE="application/vnd.oasis.opendocument.spreadsheet"

mkdir $TMP/META-INF

printManifest $MIMETYPE $(basename $CONTENT) > $MANIFEST

printHeader > $CONTENT

for file in "$@"; do
    printContent "$file" >> $CONTENT
done

printFooter >> $CONTENT

rm "$OUTPUT"

cd $TMP
zip "$OUTPUT" *.xml META-INF/*

rm -rf $TMP

The usage of the script is straight forward.

$ sh combine.sh result.osd *.csv

Conclusion

Although it is really nice to know some of the innards of the ODF standard, creating your own files using Bash scripts is kind of pointless. Most of the time the CSV import functionality of Open Office is all you need. If you need more flexibility, it is probably better to use a proper statistical programming language like R.

References