Add Pandoc Metadata With Git Log

Today was an interesting day. I worked on expanding an internal CLI app written in TypeScript. I added a new command to convert Markdown files to HTML, along with an option to watch for file changes. The CLI is build with yargs and zx, while Pandoc takes care of the Markdown to HTML conversion. As the documentation contains Mermaid diagrams, I’ve used mermaid-filter to convert these into images. However, that was not the highlight of my day. While working on this little project, I noticed many Pandoc warnings in my terminal:

text

[WARNING] This document format requires a non-empty <title> element.
  Defaulting to '<file name>' as the title.
  To specify a title, use 'title' in metadata or in --metadata title="...".

Diving a bit into the documentation, I found some information about Pandoc metadata blocks. These are simple blocks of text providing Pandoc with metadata regarding the document:

text

% title
% author(s) (separated by semicolons)
% date

A colleague of mine suggested to retrieve the author and date from the git logs. This is a good idea, because it’s fully automatable. Manual entry would have been error-prone and extremely boring. To implement this idea, we can create a script that automatically adds git metadata to our files. Here’s a possible implementation:

bash

find . -iname "*.md" -type f -print0 | \
xargs -S1024 -0 -n1 -P0 -I@ /bin/bash -c \
'output=$(cat <(git log --reverse --pretty="%% %aN%n%% %as" "@" | \
  head -n 2 | \
  awk NF && echo "") "@") \
&& echo -e "% \n$output" > "@"'

find #

The initial step in our process involves retrieving all the Markdown files. To accomplish this, we use the find command with several options. Let’s examine each of these options in detail:

Option	Description
`.`	Uses the current directly as the root to search for
`-name "*.md"`	Search for files with the Markdown extension
`-type f`	Search for files only (so no directories)
`-print0`	Separates files with an ASCII NUL character

An important aspect to highlight is the -print0 option. This option enhances the safe processing of file names. By default, xargs uses spaces, tabs, newlines, and end-of-file characters as delimiters. However, since these characters can appear in file names, using the default settings might lead to incorrect processing. To illustrate this potential issue, let’s create a set of example files:

bash

touch line$'\n'delimited.md
touch tab$'\n'delimited.md
touch space$' 'delimited.md

For demonstration purposes, let’s use a simplified version of our command. When we run find . -type f | xargs -n1 echo, the terminal output will display 6 lines. This helps us understand how the xargs command works in relation to the example files above:

text

./line
delimited.md
./tab
delimited.md
./space
delimited.md

When handling file names, use xargs -0. This option separates file names with the ASCII NUL character (hex 0), which is not allowed in file names by the operating system. Unlike tabs and newlines, which are permitted in file names. Using xargs -0 prevents unintended consequences, such as accidentally removing the wrong files or misinterpreting input data.

xargs #

The xargs command is designed to read input, split the input, and execute a command with the delimited item as arguments:

Option	Description
`-n1`	Takes at max 1 file name for each invocation
`-P0`	Runs the invocations in parallel on as many cores as possible
`-I@`	Captures the file name in a variable named @
`/bin/bash -c '<script>'`	Runs the bash `<script>` for every invocation

The script executed by xargs is rather straight forwards, but there are a few parts I’d like to highlight:

bash

  output=$(cat <(git log --reverse pretty="%% %aN%n%% %as" "@" | \
    head -n 2 | \
    awk NF && echo "" \
  ) "@") && echo -e "% @\n$output" > "@"

At first, I considered using process substitution to read the file, add the git metadata, and write everything back to the same file. However, this approach led to an unexpected infinite loop, preventing the process from terminating. An alternative would be to write the intermediate results to a temporary file, but this solution introduces additional overhead and unnecessary complexity:

Creating a temporary file
Removing the original file
Moving the new file to replace the original

Given the small file sizes, an in-memory approach seemed appropriate. The process begins by reversing the git log output, as our focus is on the initial commit for each file. From this first commit, we extract the author and date. To achieve this, we use the head -n 2 command to retrieve the first two lines of the log. Each commit is formatted using the pattern %% %aN%n%% %as. While this format may appear complex at first glance, it’s relatively straightforward:

Code	Description
%aN	Author name
%as	Commit date
%n	New line
%%	Escapes % so we can print %

Example output would be:

text

% Foo Bar
% 1984-01-02
% Baz Qux
% 1969-08-15

I encountered an interesting issue with the git log output. Sometimes it included a line feed at the end, and other times it didn’t. This inconsistency could be worth investigating further in the future to understand the underlying cause. For now, I found a simple workaround: using awk NF effectively trims all new lines from the input, resolving the problem:

bash

echo "\n\n\nhello\nworld\n\n\n" | awk NF
hello
world

cat #

If you squint a bit with you eyes, you’ll see this pattern:

bash

cat <(...) "@"

The cat command is combining two inputs. The first input comes from process substitution, denoted by <(...). This technique executes the command within the parentheses and treats its output as if it were a file. The second input is the “@” symbol, which represents the actual file being passed through xargs. Together, these two inputs are concatenated by the cat command.

Writing the output to file #

The final section of the script handles writing the output to a file. We use the -e option, which allows us to include escape sequences like \n for line breaks in the echo command. This lets us add a new line before the main content. Instead of using the file name as the title, I chose to include an empty Pandoc metadata line % . This decision was made because it’s often unlikely that the file name would be kept as the final title.

bash

... && echo -e "% @\n$output" > "@"

Conclusion #

It was satisfying to create a script for a task that would have been tedious and error-prone, if done manually. After reading “Efficient Linux at the Command Line,” I’ve realized there’s still much to learn about command-line efficiency. This experience has been both educational and eye-opening.

Posted on Aug 21, 2024 (updated on Aug 27, 2024)

Developer Blog

Anton Lijcklama à Nijeholt

Add Pandoc Metadata With Git Log

find #

xargs #

cat #

Writing the output to file #

Conclusion #