The Open-Document-Format is an open XML-based standard for spreadsheets, presentations and word processing. In its simplest form it is just a single XML-file in a ZIP-Archive. So there is no need for bloated libraries to create simple documents. All you need is a simple Bash-script and a few common Linux commands.
Background
Some time ago I gathered lots of performance test results in the form of hundreds of separate CSV files. I had to combine all those files into one big spreadsheet for simple data analysis and plotting. Every CSV file should get its own table in the resulting spreadsheet. It is easy to import a CSV file into Open Office, but it is a lot of tedious work to do it for hundreds of files. So I decided to write a script to do it for me.
Content of a minimal ODF file
All ODF files are basically just ZIP files, so you can open them with your favorite archive manager and see what's inside. A typical ODF file contains lots of configuration files, thumbnails and settings, but most of it is unnecessary.
$ unzip -l libreoffice.ods
Archive: libreoffice.ods
Length Date Time Name
--------- ---------- ----- ----
46 2015-08-15 19:34 mimetype
3163 2015-08-15 19:34 Thumbnails/thumbnail.png
7431 2015-08-15 19:34 settings.xml
14268 2015-08-15 19:34 content.xml
844 2015-08-15 19:34 meta.xml
10065 2015-08-15 19:34 styles.xml
0 2015-08-15 19:34 Configurations2/images/Bitmaps/
0 2015-08-15 19:34 Configurations2/popupmenu/
0 2015-08-15 19:34 Configurations2/statusbar/
0 2015-08-15 19:34 Configurations2/progressbar/
0 2015-08-15 19:34 Configurations2/toolbar/
0 2015-08-15 19:34 Configurations2/menubar/
0 2015-08-15 19:34 Configurations2/floater/
0 2015-08-15 19:34 Configurations2/accelerator/current.xml
0 2015-08-15 19:34 Configurations2/toolpanel/
899 2015-08-15 19:34 manifest.rdf
1093 2015-08-15 19:34 META-INF/manifest.xml
--------- -------
37809 17 files
Only two of these files are really needed:
$ unzip -l minimal.ods
Archive: minimal.ods
Length Date Time Name
--------- ---------- ----- ----
14268 2015-08-15 19:34 content.xml
398 2015-09-01 08:36 META-INF/manifest.xml
--------- -------
14666 2 files
- META-INF/manifest.xml
- Contains the mime type and a list of files in the archive.
- content.xml
- The actual data encoded as XML.
All we need to do is create two XML files and zip them. A Bash-script is ideal for this task, but you can easily do it in any other programming language of your choice.
Creating the Manifest
The file META-INF/manifest.xml
contains the mime type of the document and a list with all the files contained in the ZIP archive. The following function takes the mime type and a list of files as parameters and prints the manifest to STDOUT
:
function printManifest() {
echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
echo "<manifest:manifest "
echo "xmlns:manifest=\"urn:oasis:names:tc:opendocument:xmlns:manifest:1.0\" "
echo "manifest:version=\"1.2\">"
# Add mime type
echo "<manifest:file-entry manifest:full-path=\"/\" manifest:version=\"1.2\""
echo "manifest:media-type=\"$1\"/>"
shift
# List all the files in the archive
for file in "$@"; do
echo "<manifest:file-entry manifest:full-path=\"$file\""
echo "manifest:media-type=\"text/xml\"/>"
done
echo "</manifest:manifest>"
}
Encoding the Content
Creating the file content.xml
can be divided into the following steps:
Creating the Root-Tag
All namespaces used in the document have to be declared in the root tag of the XML. We only need a few of them for our simple spreadsheet. The function printNamespaces()
can be used to conveniently select a subset from the NAMESPACES
map.
declare -A NAMESPACES=(
[office]="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
[style]="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
[text]="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
[table]="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
[draw]="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
[fo]="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
[xlink]="http://www.w3.org/1999/xlink"
[dc]="http://purl.org/dc/elements/1.1/"
[meta]="urn:oasis:names:tc:opendocument:xmlns:meta:1.0"
[number]="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
[presentation]="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0"
[svg]="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
[chart]="urn:oasis:names:tc:opendocument:xmlns:chart:1.0"
[dr3d]="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0"
[math]="http://www.w3.org/1998/Math/MathML"
[form]="urn:oasis:names:tc:opendocument:xmlns:form:1.0"
[script]="urn:oasis:names:tc:opendocument:xmlns:script:1.0"
[ooo]="http://openoffice.org/2004/office"
[ooow]="http://openoffice.org/2004/writer"
[oooc]="http://openoffice.org/2004/calc"
[dom]="http://www.w3.org/2001/xml-events"
[xforms]="http://www.w3.org/2002/xforms"
[xsd]="http://www.w3.org/2001/XMLSchema"
[xsi]="http://www.w3.org/2001/XMLSchema-instance"
[rpt]="http://openoffice.org/2005/report"
[of]="urn:oasis:names:tc:opendocument:xmlns:of:1.2"
[xhtml]="http://www.w3.org/1999/xhtml"
[grddl]="http://www.w3.org/2003/g/data-view"
[tableooo]="http://openoffice.org/2009/table"
[drawooo]="http://openoffice.org/2010/draw"
[calcext]="urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0"
[loext]="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0"
[field]="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0"
[formx]="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0"
[css3t]="http://www.w3.org/TR/css3-text/"
)
function printNamespaces() {
for ns in "$@"; do
echo "xmlns:$ns=\"${NAMESPACES[$ns]}\""
done
}
function printHeader() {
echo "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
echo "<office:document-content office:version=\"1.2\" "
printNamespaces office table text
echo "><office:body><office:spreadsheet>"
}
Parsing of the CSV Input and Table Creation
The function printContent()
takes a CSV file as its parameter and prints a XML table to STDOUT
. The CSV file is read in line by line. Every line is split into individual cells and every cell then is enclosed in a <table:table-cell>
tag.
function printRow() {
echo "<table:table-row>"
for cell in $line; do
# Use bash regex to check if it is a number
re='^[0-9.]+$'
if [[ "$cell" =~ $re ]]; then
echo "<table:table-cell office:value-type=\"float\""
echo "office:value=\"$cell\"/>"
else
echo "<table:table-cell office:value-type=\"string\">"
echo "<text:p>$cell</text:p>"
echo "</table:table-cell>"
fi
done
echo "</table:table-row>"
}
function printContent() {
# Create a separate table for every input file
echo "<table:table table:name=\"$(basename $1 .csv)\">"
# Read the input file line by line
while IFS='' read -r line || [[ -n "$line" ]]; do
IFS=';' printRow "$line"
done < "$1"
echo "</table:table>"
}
Closing Open Tags
The following trivial piece of code closes all remaining open tags:
function printFooter() {
echo "</office:spreadsheet></office:body></office:document-content>"
}
Everything put together
The following piece of code uses all the functions defined above to create an ODS file from multiple CSV files. It uses a temporary directory to create the directory structure of the ODS file and then adds everything to a ZIP file.
#!/bin/bash
OUTPUT="$(readlink -f $1)"
shift
TMP=`mktemp -d`
MANIFEST=$TMP/META-INF/manifest.xml
CONTENT=$TMP/content.xml
MIMETYPE="application/vnd.oasis.opendocument.spreadsheet"
mkdir $TMP/META-INF
printManifest $MIMETYPE $(basename $CONTENT) > $MANIFEST
printHeader > $CONTENT
for file in "$@"; do
printContent "$file" >> $CONTENT
done
printFooter >> $CONTENT
rm "$OUTPUT"
cd $TMP
zip "$OUTPUT" *.xml META-INF/*
rm -rf $TMP
The usage of the script is straight forward.
$ sh combine.sh result.osd *.csv
Conclusion
Although it is really nice to know some of the innards of the ODF standard, creating your own files using Bash scripts is kind of pointless. Most of the time the CSV import functionality of Open Office is all you need. If you need more flexibility, it is probably better to use a proper statistical programming language like R.
References
- Wikipedia on Open-Document-Format - https://en.wikipedia.org/wiki/OpenDocument
- R Project - https://www.r-project.org/