Add Pandoc Metadata With Git Log
Today was an interesting day. I worked on expanding an internal CLI app written in TypeScript. I added a new command to convert Markdown files to HTML, along with an option to watch for file changes. The CLI is build with yargs and zx, while Pandoc takes care of the Markdown to HTML conversion. As the documentation contains Mermaid diagrams, I’ve used mermaid-filter to convert these into images. However, that was not the highlight of my day. While working on this little project, I noticed many Pandoc warnings in my terminal:
[WARNING] This document format requires a non-empty <title> element.
Defaulting to '<file name>' as the title.
To specify a title, use 'title' in metadata or in --metadata title="...".
Diving a bit into the documentation, I found some information about Pandoc metadata blocks. These are simple blocks of text providing Pandoc with metadata regarding the document:
% title
% author(s) (separated by semicolons)
% date
A colleague of mine suggested to retrieve the author and date from the git logs. This is a good idea, because it’s fully automatable. Manual entry would have been error-prone and extremely boring. To implement this idea, we can create a script that automatically adds git metadata to our files. Here’s a possible implementation:
find . -iname "*.md" -type f -print0 | \
xargs -S1024 -0 -n1 -P0 -I@ /bin/bash -c \
'output=$(cat <(git log --reverse --pretty="%% %aN%n%% %as" "@" | \
head -n 2 | \
awk NF && echo "") "@") \
&& echo -e "% \n$output" > "@"'
find #
The initial step in our process involves retrieving all the Markdown files. To
accomplish this, we use the find
command with several options. Let’s examine
each of these options in detail:
Option | Description |
---|---|
. |
Uses the current directly as the root to search for |
-name "*.md" |
Search for files with the Markdown extension |
-type f |
Search for files only (so no directories) |
-print0 |
Separates files with an ASCII NUL character |
An important aspect to highlight is the -print0
option. This option enhances
the safe processing of file names. By default, xargs
uses spaces, tabs,
newlines, and end-of-file characters as delimiters. However, since these
characters can appear in file names, using the default settings might lead to
incorrect processing. To illustrate this potential issue, let’s create a set of
example files:
touch line$'\n'delimited.md
touch tab$'\n'delimited.md
touch space$' 'delimited.md
For demonstration purposes, let’s use a simplified version of our command. When
we run find . -type f | xargs -n1 echo
, the terminal output will display 6
lines. This helps us understand how the xargs
command works in relation to the
example files above:
./line
delimited.md
./tab
delimited.md
./space
delimited.md
When handling file names, use xargs -0
. This option separates file names with
the ASCII NUL character (hex 0), which is not allowed in file names by the
operating system. Unlike tabs and newlines, which are permitted in file names.
Using xargs -0
prevents unintended consequences, such as accidentally removing
the wrong files or misinterpreting input data.
xargs #
The xargs
command is designed to read input, split the input, and execute a
command with the delimited item as arguments:
Option | Description |
---|---|
-n1 |
Takes at max 1 file name for each invocation |
-P0 |
Runs the invocations in parallel on as many cores as possible |
-I@ |
Captures the file name in a variable named @ |
/bin/bash -c '<script>' |
Runs the bash <script> for every invocation |
The script executed by xargs
is rather straight forwards, but there are a few
parts I’d like to highlight:
output=$(cat <(git log --reverse pretty="%% %aN%n%% %as" "@" | \
head -n 2 | \
awk NF && echo "" \
) "@") && echo -e "% @\n$output" > "@"
At first, I considered using process substitution to read the file, add the git metadata, and write everything back to the same file. However, this approach led to an unexpected infinite loop, preventing the process from terminating. An alternative would be to write the intermediate results to a temporary file, but this solution introduces additional overhead and unnecessary complexity:
- Creating a temporary file
- Removing the original file
- Moving the new file to replace the original
Given the small file sizes, an in-memory approach seemed appropriate. The
process begins by reversing the git log
output, as our focus is on the initial
commit for each file. From this first commit, we extract the author and date. To
achieve this, we use the head -n 2
command to retrieve the first two lines of
the log. Each commit is formatted using the pattern %% %aN%n%% %as
. While this
format may appear complex at first glance, it’s relatively straightforward:
Code | Description |
---|---|
%aN | Author name |
%as | Commit date |
%n | New line |
%% | Escapes % so we can print % |
Example output would be:
% Foo Bar
% 1984-01-02
% Baz Qux
% 1969-08-15
I encountered an interesting issue with the git log output. Sometimes it
included a line feed at the end, and other times it didn’t. This inconsistency
could be worth investigating further in the future to understand the underlying
cause. For now, I found a simple workaround: using awk NF
effectively trims
all new lines from the input, resolving the problem:
echo "\n\n\nhello\nworld\n\n\n" | awk NF
hello
world
cat #
If you squint a bit with you eyes, you’ll see this pattern:
cat <(...) "@"
The cat
command is combining two inputs. The first input comes from process
substitution, denoted by <(...)
. This technique executes the command within the
parentheses and treats its output as if it were a file. The second input is the
“@” symbol, which represents the actual file being passed through xargs
.
Together, these two inputs are concatenated by the cat
command.
Writing the output to file #
The final section of the script handles writing the output to a file. We use the
-e
option, which allows us to include escape sequences like \n
for line
breaks in the echo command. This lets us add a new line before the main content.
Instead of using the file name as the title, I chose to include an empty Pandoc
metadata line %
. This decision was made because it’s often unlikely that the
file name would be kept as the final title.
... && echo -e "% @\n$output" > "@"
Conclusion #
It was satisfying to create a script for a task that would have been tedious and error-prone, if done manually. After reading “Efficient Linux at the Command Line,” I’ve realized there’s still much to learn about command-line efficiency. This experience has been both educational and eye-opening.
Posted on Aug 21, 2024 (updated on Aug 27, 2024)