Shaving a yak for Excel

As our business grows, our needs to analyse and understand it grow even faster. This means we must get better at generating reports so that more people can use them with less effort. One tool for improving long reports is to break them up in sections. In Excel spreadsheets, this can be done with the outline feature. This post looks at how we made the axlsx gem show outlines the way we wanted.

This is a simple example of a table with 3 rows of detailed data and their sums at the end:

1. sums below expanded

We can press the minus-in-a-box to collapse the detail:

2. sums below collapsed

With 10 sections like this in a report, the reader has the option to look at all the details or only an overview—with all the data fitting on one screen.  The example was created using this snippet which should be fairly straight forward:

xls(name: "sums_below.xlsx") do |sheet|
  sheet.add_row ["Lorem ipsum", 123, 234, 345]
  sheet.add_row ["Lorem ipsum", 321, 432, 543]
  sheet.add_row ["Lorem ipsum", 345, 456, 567]
  sheet.add_row [
    "Sums:", "=SUM(B1:B3)", "=SUM(C1:C3)", "=SUM(D1:D3)",
  ]
  # Set rows R1 to R3 (zero-based: 0 to 2) to
  # outline level 1 and do not hide them.
  sheet.outline_level_rows(0, 2, 1, false)
end

Please ignore the xls helper function for the moment, but note that it uses the Axlsx gem which is a great tool.

Now, for various reasons I want the sums above the data instead of below, so I change the code:

xls(name: "sums_above.xlsx") do |sheet|
  sheet.add_row [
    "Sums:", "=SUM(B2:B4)", "=SUM(C2:C4)", "=SUM(D2:D4)",
  ]
  sheet.add_row ["Lorem ipsum", 123, 234, 345]
  sheet.add_row ["Lorem ipsum", 321, 432, 543]
  sheet.add_row ["Lorem ipsum", 345, 456, 567]
  sheet.outline_level_rows(1, 3, 1, false)
end

The sums are on top all right, but the collapse/expand button stays at the bottom, which is disappointing:

3. sums above but button below

(If you think that is obvious, consider that if you create this sheet manually in Excel, the button will end up next to the sums automatically).  So how do we make it go to the top?  The documentation (and the vast collection of examples) have nothing to say on the subject, but they don’t go into a lot of detail about anything, so it doesn’t surprise me.  On the other hand, past experience shows that the gem is better than the documentation, so I’m confident that we can find a way to do it.  But how?

Use the Source, Luke!

Luckily, Microsoft replaced the binary file formats for the Office suite with standardised zip archives full of XML files years ago, so we can inspect the raw sheet to learn its secrets.  Here’s what I did:

  1. Save “sums_above.xlsx” as “plus_below.xlsx” (using File > Save As). This gives us an official baseline.  (This step is important!  Axlsx creates great spreadsheets, but the source doesn’t look exactly like the one Excel creates.)
  2. Clear all outlines (Data > Group and Outline > Clear Outline).
  3. Turn on automatic outlines (Data > Group and Outline > Auto Outline).  This creates the look we want: 6. auto outline
  4. Save the file as “plus_above.xlsx”.
  5. Unzip the two xlsx files: unzip -d below plus_below.xlsx && unzip -d above plus_above.xlsx

The hypothesis is that the only difference between the two sets of XML files will be whatever makes Excel put the collapse button before the rows that should be hidden. But how do we find the differences?  Running diff on the two sets (diff -u -r below above) should do the trick, but the output is not very helpful.  It finds differences in two files, not one, and the files are basically one long line and diff does not say where the differences are:

$ diff -u -r below above
diff -u -r below/docProps/core.xml above/docProps/core.xml
--- below/docProps/core.xml	1980-01-01 00:00:00.000000000 +0100
+++ above/docProps/core.xml	1980-01-01 00:00:00.000000000 +0100
@@ -1,2 +1,2 @@
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
-<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><dc:creator>axlsx</dc:creator><cp:lastModifiedBy>Microsoft Office User</cp:lastModifiedBy><cp:revision>0</cp:revision><dcterms:created xsi:type="dcterms:W3CDTF">2018-03-02T18:07:18Z</dcterms:created><dcterms:modified xsi:type="dcterms:W3CDTF">2018-03-02T17:07:50Z</dcterms:modified></cp:coreProperties>
\ No newline at end of file
+<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><dc:creator>axlsx</dc:creator><cp:lastModifiedBy>Microsoft Office User</cp:lastModifiedBy><cp:revision>0</cp:revision><dcterms:created xsi:type="dcterms:W3CDTF">2018-03-02T18:07:18Z</dcterms:created><dcterms:modified xsi:type="dcterms:W3CDTF">2018-03-02T17:08:08Z</dcterms:modified></cp:coreProperties>
\ No newline at end of file
diff -u -r below/xl/worksheets/sheet1.xml above/xl/worksheets/sheet1.xml
--- below/xl/worksheets/sheet1.xml	1980-01-01 00:00:00.000000000 +0100
+++ above/xl/worksheets/sheet1.xml	1980-01-01 00:00:00.000000000 +0100
@@ -1,2 +1,2 @@
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
-<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"><dimension ref="A1:D4" /><sheetViews><sheetView tabSelected="1" showWhiteSpace="0" workbookViewId="0" /></sheetViews><sheetFormatPr baseColWidth="10" defaultColWidth="8.83203125" defaultRowHeight="14" outlineLevelRow="1" x14ac:dyDescent="0.15" /><cols><col min="1" max="1" width="11" bestFit="1" customWidth="1" /><col min="2" max="4" width="6.6640625" bestFit="1" customWidth="1" /></cols><sheetData><row r="1" spans="1:4" x14ac:dyDescent="0.15"><c r="A1" t="s"><v>0</v></c><c r="B1"><f>SUM(B2:B4)</f><v>789</v></c><c r="C1"><f>SUM(C2:C4)</f><v>1122</v></c><c r="D1"><f>SUM(D2:D4)</f><v>1455</v></c></row><row r="2" spans="1:4" outlineLevel="1" x14ac:dyDescent="0.15"><c r="A2" t="s"><v>1</v></c><c r="B2"><v>123</v></c><c r="C2"><v>234</v></c><c r="D2"><v>345</v></c></row><row r="3" spans="1:4" outlineLevel="1" x14ac:dyDescent="0.15"><c r="A3" t="s"><v>1</v></c><c r="B3"><v>321</v></c><c r="C3"><v>432</v></c><c r="D3"><v>543</v></c></row><row r="4" spans="1:4" outlineLevel="1" x14ac:dyDescent="0.15"><c r="A4" t="s"><v>1</v></c><c r="B4"><v>345</v></c><c r="C4"><v>456</v></c><c r="D4"><v>567</v></c></row></sheetData><pageMargins left="0.75" right="0.75" top="1" bottom="1" header="0.5" footer="0.5" /><pageSetup orientation="portrait" /></worksheet>
\ No newline at end of file
+<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"><sheetPr enableFormatConditionsCalculation="0"><outlinePr summaryBelow="0" /></sheetPr><dimension ref="A1:D4" /><sheetViews><sheetView tabSelected="1" showWhiteSpace="0" workbookViewId="0" /></sheetViews><sheetFormatPr baseColWidth="10" defaultColWidth="8.83203125" defaultRowHeight="14" outlineLevelRow="1" x14ac:dyDescent="0.15" /><cols><col min="1" max="1" width="11" bestFit="1" customWidth="1" /><col min="2" max="4" width="6.6640625" bestFit="1" customWidth="1" /></cols><sheetData><row r="1" spans="1:4" x14ac:dyDescent="0.15"><c r="A1" t="s"><v>0</v></c><c r="B1"><f>SUM(B2:B4)</f><v>789</v></c><c r="C1"><f>SUM(C2:C4)</f><v>1122</v></c><c r="D1"><f>SUM(D2:D4)</f><v>1455</v></c></row><row r="2" spans="1:4" outlineLevel="1" x14ac:dyDescent="0.15"><c r="A2" t="s"><v>1</v></c><c r="B2"><v>123</v></c><c r="C2"><v>234</v></c><c r="D2"><v>345</v></c></row><row r="3" spans="1:4" outlineLevel="1" x14ac:dyDescent="0.15"><c r="A3" t="s"><v>1</v></c><c r="B3"><v>321</v></c><c r="C3"><v>432</v></c><c r="D3"><v>543</v></c></row><row r="4" spans="1:4" outlineLevel="1" x14ac:dyDescent="0.15"><c r="A4" t="s"><v>1</v></c><c r="B4"><v>345</v></c><c r="C4"><v>456</v></c><c r="D4"><v>567</v></c></row></sheetData><pageMargins left="0.75" right="0.75" top="1" bottom="1" header="0.5" footer="0.5" /><pageSetup orientation="portrait" /></worksheet>
\ No newline at end of file

Let’s use xmllint to format the XML files in a way that’s more readable for humans1:

tmpfile=$(mktemp -t xml)
for f in $(find above below -name "*.xml"); do
    cp $f $tmpfile
    xmllint --format -o $f $tmpfile
done

(I haven’t found a way to make xmllint change files in-place, hence the use of a temp file.)  Now the diff output becomes more palatable:

$ diff -u -r below above
diff -u -r below/docProps/core.xml above/docProps/core.xml
--- below/docProps/core.xml	2018-03-01 15:29:39.000000000 +0100
+++ above/docProps/core.xml	2018-03-01 15:29:39.000000000 +0100
@@ -4,5 +4,5 @@
   <cp:lastModifiedBy>Microsoft Office User</cp:lastModifiedBy>
   <cp:revision>0</cp:revision>
   <dcterms:created xsi:type="dcterms:W3CDTF">2018-03-01T13:17:46Z</dcterms:created>
-  <dcterms:modified xsi:type="dcterms:W3CDTF">2018-03-01T12:18:55Z</dcterms:modified>
+  <dcterms:modified xsi:type="dcterms:W3CDTF">2018-03-01T12:19:24Z</dcterms:modified>
 </cp:coreProperties>
diff -u -r below/xl/worksheets/sheet1.xml above/xl/worksheets/sheet1.xml
--- below/xl/worksheets/sheet1.xml	2018-03-01 15:29:39.000000000 +0100
+++ above/xl/worksheets/sheet1.xml	2018-03-01 15:29:39.000000000 +0100
@@ -1,5 +1,8 @@
 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
 <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" …>
+  <sheetPr enableFormatConditionsCalculation="0">
+    <outlinePr summaryBelow="0"/>
+  </sheetPr>
   <dimension ref="A1:D4"/>
   <sheetViews>
     <sheetView tabSelected="1" showWhiteSpace="0" workbookViewId="0"/>

The first chunk is a timestamp, so we can ignore that. The second has what we need: the good file has a sheetPr section with a outlinePr element with a summaryBelow attribute—that has to be it!

Now let’s find out how to make Axlsx set that attribute the way we want. The gem is listed in my Gemfile so I open it using VISUAL=subl bundle open axlsx (I even have an alias, bopen='VISUAL=subl bundle open') and start looking. Searching for summaryBelow yields two files that belong to the test suite but nothing more. But having used axlsx for a while, I know that it has never made me use any CamelCase identifiers (it’s a Ruby library through and through) and it’s a small leap to instead search for summary_below. This points to an OutlinePr class, which (as a search reveals) is only ever used by the SheetPr class, which mimics the structure we saw in the XML file. SheetPr in its turn is only used by the Worksheet class, and now we have enough information to string this together:

xls(name: "sums_above.xlsx") do |sheet|
  sheet.add_row [
    "Sums:", "=SUM(B2:B4)", "=SUM(C2:C4)", "=SUM(D2:D4)",
  ]
  sheet.add_row ["Lorem ipsum", 123, 234, 345]
  sheet.add_row ["Lorem ipsum", 321, 432, 543]
  sheet.add_row ["Lorem ipsum", 345, 456, 567]
  sheet.outline_level_rows(1, 3, 1, false)
  sheet.sheet_pr.outline_pr.summary_below = false
end

We run it, and it works!

Footnotes

  1. If you find the use of xmllint overkill or just don’t have it installed, I recommend perl -i -pe 's/</\n</g;' $(find above below -name "*.xml") instead.
Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *