The minimum viable XLSX reader

I recently finished a basic XLSX reader for OCaml and I thought it would be a good time to summarize what's necessary to make a library like this, since the documentation is complicated and sometimes hard to find.

Documentation

If you're planning to read XLSX files, you probably want to reference ECMA 376, which describes the basics of the file format (Part 2: Open Packaging Conventions) and the spreadsheet format itself (Part 4: SpreadsheetML). Note that if you are reading this specs, you probably want the 1st edition, since the later editions are just a description of differences since the previous edition.

You might also want to read this overview PDF.

Also, if you're looking for the older XLS format (which is not covered in this blog post), OpenOffice has documentation for it.

Open Packaging Conventions

All Microsoft Office documents are ZIP files containing XML files (this is the "Open Packaging Conventions"). I recommend unzipping an XLSX file while you work so you can reference the XML files directly and check your assumptions. Parsing ZIP files is out scope; I recommend using an off-the-shelf ZIP library, or read the ZIP spec if you're needlessly hardcore.

`[Content_Types].xml`

The entrypoint to one of these files is the top-level [Content_Types].xml, which lists some files ("parts"), their paths, and their content type. For XLSX, you should expect to see a workbook with ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml". There should also be a relationships file with ContentType="application/vnd.openxmlformats-package.relationships+xml". I'm not sure if the spec requires this, but in practice, every XLSX file I've seen puts these at xl/workbook.xml and xl/_rels/workbook.xml.rels.

An example [Content_Types].xml file looks like this:

<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Override PartName="/xl/_rels/workbook.xml.rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Override PartName="/xl/sharedStrings.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml"/>
    <Override PartName="/xl/worksheets/sheet1.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
    <Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
    <Override PartName="/xl/styles.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml"/>
    <Override PartName="/_rels/.rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
    <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
</Types>

PartName is an absolute path from the top of the ZIP file to the given "part"
ContentType is a MIME type

Relationships

The workbook relationships file (generally xl/_rels/workbook.xml.rels) lists ID's of resources and their paths in the ZIP file. This content is similar to what you find in [Content_Types].xml, but you need it because workbooks reference sheets by relationship ID (which are not the same as sheet ID's).

An XLSX relationship file looks like this:

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
    <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheets/sheet1.xml"/>
    <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/sharedStrings" Target="sharedStrings.xml"/>
    <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheets/sheet2.xml"/>
</Relationships>

This is saying that for any file in the workbook, if you see r:id, use the file with that Id in the relationships file. The main thing you need from this file is the relationship ID's for the sheets.

Workbook

Next, you'll open up the workbook file (xl/_rels/workbook.xml). This file has a lot of metadata about the workbook, but most of it is display-related. The only piece we really need is in <sheets>, which lists sheet names and relationship ID's:

<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    ...
    <sheets>
        <sheet name="Summary" sheetId="1" state="visible" r:id="rId2"/>
        <sheet name="Data" sheetId="2" state="visible" r:id="rId4"/>
    </sheets>
    ...
</workbook>

This tells us that the first sheet is named "Summary" and has relationship ID "rId2", and the second sheet is named "Data" and has relationship ID "rId4". Note that as far as I can tell, sheetId is not actually useful. You might expect that sheet 1 would be at worksheets/sheet1.xml, but this is not necessarily true. The relationship ID is the only way to map sheets to XML files.

Once we have the relationship ID, we can refer back to the relationship file and find that the "Summary" sheet is at xl/worksheets/sheet1.xml and the "Data" sheet is at xl/worksheets/sheet2.xml.

Sheet data

The sheet[n].xml file has all of the sheet-specific information, including more display-related metadta. If you just want the data, you can ignore everything except <sheetData>, which looks like this:

<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    ...
    <sheetData>
        <row r="1">
            <c r="A1" s="0" t="s">
                <v>0</v>
            </c>
            <c r="B1" s="0" t="inlineStr">
                <is>
                    <t>A single inline string</t>
                </is>
            </c>
            <c r="C1" s="2" t="n">
                <v>1234.56789</v>
            </c>
            <c r="D1" s="0" t="s">
                <v>1</v>
            </c>
        </row>
        <row r="3">
            <c r="A3" s="1" t="b">
                <f aca="false">FALSE()</f>
                <v>0</v>
            </c>
            <c r="C3" s="2" t="inlineStr">
                <is>
                    <r>
                        <t xml:space="preserve">A multi-part </t>
                        <t>string!</t>
                    </r>
                </is>
            </c>
        </row>
    </sheetData>
</worksheet>

Each <row> has an index (starting at 1). Note that empty rows can be left out (in the example above, row 2 is empty).

Each cell (<c>) has:

A "cell reference", like r="A1". This means the cell is in column A (the first column) and row 1. Note that cells can also be left out if they're empty (like cell B3 in my example), so you'll probably want to store cells in a hash table, or keep track of missing columns when building an array of cells.
A type (t). The valid types are:
- b - Boolean
- e - Error
- inlineStr - Inline string
- n - Number
- s - Shared string (see "Shared String table" below)
- str - Formula string
Note that n covers integers, floats, dates, times, and sometimes booleans (see "Styles" below). The default is n.
A style index (s). This is the zero-based index into the cell styles array (see "Styles" below). The default value is 0.
A value. This can be a <v> holding a number, or an <is> holding an inline string. In the case of shared strings, the value in the <v> is the zero-based index into the shared string table.

For inline strings, the <is> element contains either a single string in a <t> element, or a list of rich text strings in a <r> element, with the text in <t> elements. The contents of an <is> are the same as the elements of the shared string table.
An optional formula, in an <f> element. Handling these is out of scope for this post, since generally the result of the formula should also be available in a <v> element.

Now we have the sheet data, but to actually access strings, we need to read the "Shared String Table", and to figure out if a number is a date or time, we need to looking at the stylesheets.

Shared string table

To avoid data duplication between sheets (see this article about ZIP file deduplication, Excel stores strings in one big "shared string table". This is an XML document listing an array of strings, which worksheet can reference. This file will be listed in [Content_Types].xml with ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml", and it will also be in the relationships file as Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/sharedStrings".

In every case I've seen, this file was named xl/sharedStrings.xml.

The relevant portions of the file look like:

<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="2" uniqueCount="2">
    <si>
        <t>A single string</t>
    </si>
    <si>
        <r>
            <t xml:space="preserve">Another multi </t>
            <t>part string</t>
        </r>
    </si>
</sst>

The <si> element should be familiar from the previous section, and contains exactly the same content as it does for inline strings. In this example, the string table is the array ["A single string", "Another multi part string"] (assuming you don't care about formatting). Given a cell like <c t="s"><v>0</v></c>, the actual value is "A single string".

As far as I can tell, the shared string table is only valid for strings with type t="s". If a number is repeated several times, it will still be repeated inline in the worksheet.

Number types (number vs date)

Finally, since number types cover dates as well as numbers, you'll almost certainly want to be able to tell them apart. Unfortunately, doing so is remarkably complicated, since the only way to tell is by looking up format strings, figuring out what type they're formatting, and then using that to imply the data type.

To figure this information out, you'll need to find the stylesheet XML file, which has ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml" in the [Content_Types].xml, and Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" in the relationships file.

You will need two pieces of this file:

The custom number formats (<numFmts>)
The cell formatting records (<cellXfs>)

The relevant portions of the stylesheet XML looks like this:

<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
    <numFmts count="2">
        <numFmt numFmtId="164" formatCode="HH:MM"/>
        <numFmt numFmtId="165" formatCode="0,000.0000"/>
    </numFmts>
    <cellXfs>
        <xf numFmtId="0"/>
        <xf numFmtId="164">
        <xf numFmtId="165">
    </cellXfs>
</styleSheet>

The <cellXfs> is an array that works similarly to the share string table, where the s value is the zero-based index into the array built from it. For example, <c s="1"/> would use the second <xf> in the cellXfs> list.

The <xf> then contains a numFmtId, which can either reference a built-in number format string, or a custom number format string. The custom number formats are listed in <numFmts>, with an arbitrary integer ID (these do not need to be continuous).

The built-in number formats are not all documented, but the SpreadsheetML spec lists around a hundred of them in ECMA 376 Part 2, Section 3.8.30 numFmt (Number Format), and the next section (3.8.31 numFmts (Number Formats)) explains what each element of the number format means. For example, # means a digit that only displays if it's non-zero, 0 means a digit that is also displayed, yy is the two-digit year, etc.

You can see some of the built-in formats here

For the purposes of understanding data types, all you need to know is:

Ignore everything inside of quotes ("") or brackets ([]). These are literal strings and color formatting codes, which aren't relevant to the types.
If the string contains y (year), d (day), w (week), or q (quarter), it's a date. The XLSX spec doesn't mention w or q, but they're reverse-engineered from the output of OpenOffice.
If it contains h (hour), s (second), AM/PM, or A/P then it's a time.
If it contains both, it's a date + time.
- Note: Dates can also contain months and times can contain minutes, but the rules make it hard to tell them apart without a real parser, since m means minutes only if it immediately follows h, and otherwise means months.
If it contains @ it's a string.
In any other case, it's a number (this will generally contain 0, #, ., etc., but these are also valid in the milliseconds part of a time).

Now that you can tell if a number is a date or time, you can get a date by taking the integral portion of a number (the part before the .) and using it as the number of days since December 30th, 1899. The fractional part of the number (after the .) is the time of the day as a floating point number (i.e. 0.25 is 6 AM, 0.75 is 6 PM).

For example, the time I'm typing this is 8:13 PM on Feb 4th, 2018. This would be represented in XLSX as 43135.842361:

<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    <sheetData>
        <row r="1">
            <c r="A1" s="0" t="n">
                <v>43135.842361</v>
            </c>
        </row>
    </sheetData>
</worksheet>

<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
    <numFmts count="1">
        <numFmt numFmtId="164" formatCode="yyyy-mm-dd hh:mm"/>
    </numFmts>
    <cellXfs>
        <xf numFmtId="0"/>
    </cellXfs>
</styleSheet>

<!-- A1 is a date that should be formatted as "2018-02-04 20:13" -->

Putting it all together

XLSX format diagram

So to sum up, if you want to read just the data from an XLSX file, and you want to be able to tell numbers and dates apart, you need to:

Read the [Content_Types].xml to find the workbook
Read the xl/_rels/workbook.rels.xml to find the relationship ID's for the sheets, and also to find the shared string table and stylesheet.
Read the shared string table into an array (xl/sharedStrings.xml)
Read the stylesheets (xl/styles.xml), merge the numFmts into a hash table with the built-in number formats, and read the cell format records (<cellStyleXfs>) into an array, then determine types based on the contents of the format strings
Read the sheet data from xl/sheets/sheet[n].xml, lookup shared string data in your shared string array, and lookup type info that you determined from the stylesheet.

Easy, right?