Programming « Brain Dump

Generating HTML pages from Latex

Posted by cheshirekow in LaTeX on June 29, 2011

While latex is pretty much “not designed” for web content, it is very useful to generate a web-version of a latex document. The purpose of latex is clearly for typesetting layouts on a pre-defined page, but when you want to share the information with others, it’s generally a lot easier for them to go to a webpage then it is to download and open a PDF. In addition, it’s generally easier to view a webpage than a PDF because the content is continuous, and one can scroll around and click hyperlinks in a way that is far more fluid than on a PDF.

Now that MathML and SVG are becoming more supported by web browsers, there is a strong case for sharing mathy documents on the web in addition to paper documents (or PDFs, which are only slightly more readable than paper).

To this end, I’ve been evaluating various different Latex to HTML converters. I’ve tried the following on Linux (Ubuntu):

By far my favorite is LaTeXML. It generates crisp, simple pages using MathML and CSS, making it easy to customize the style. It doesn’t support a whole lot of packages that I generally would like to use (like algorithm2e), but then again none of them do. Also, the ArXiV project is working on a branch of LaTeXML so there is promise that it will grow quickly to support a lot of the best packages.

Document Setup

My current approach to generating both PDFs and HTMLs from latex source is to use separate top-level documents for both. The directory structure looks something like this:

    document
     |- document_html.tex
     |- document_pdf.tex
     |- document.tex
     |- preamble_common.tex
     |- preamble_html.tex
     |- preamble_pdf.tex
     \- references.bib

The two versions of document_[output].tex are the top-level files. They look like this:

%document_html.tex
 
\documentclass[10pt]{article}
\input{preamble_common}
\input{preamble_html} 
\begin{document}
\input{document}
\end{document}

The pdf version is the same but it uses preamble_pdf as an input. Note that in latex you cannot nest \include directives, but you can nest \input directives. Also, \include inserts a page-break so there is no need to use them here. Rather document.tex may \include it’s chapters as tex files or the like.

Makefile

To ease the process of generating the different types, I’m using a makefile.

# The following definitions are the specifics of this project
PDF_OUTPUT  :=  document.pdf
HTML_OUTPUT :=  document.html
 
PDF_MAIN	:=  document_pdf.tex
HTML_MAIN   :=  document_html.tex
 
COMMON_TEX 	:=	document.tex \
                preamble_common.tex
 
PDF_TEX		:=  $(COMMON_SRC) \
                document_pdf.tex \
                preamble_pdf.tex
 
HTML_TEX    :=  $(COMMON_SRC) \
                document_html.tex \
                preamble_html.tex 
 
BIB         :=  references.bib
 
 
 
# these variables are the dependencies for the outputs
PDF_SRC     := $(PDF_TEX) $(BIB)
HTML_SRC    := $(HTML_TEX) $(BIB)
 
# the 'all' target will make both the pdf and html outputs
all: pdf html
 
# the 'pdf' target will make the pdf output
pdf: $(PDF_OUTPUT)
 
# the 'html' target will make the html output
html: $(HTML_OUTPUT)
 
# the pdf output depends on the pdf tex files
# we use a shell script to optionally run pdflatex multiple times until the
# output does not suggest that we rerun latex
$(PDF_OUTPUT): $(PDF_TEX) 
	@echo "Running pdflatex on $(PDF_MAIN)"
	@pdflatex $(basename $(PDF_MAIN)) > $(basename $(PDF_MAIN))_0.log
	@echo "Running bibtex"
	@-bibtex   $(basename $(PDF_MAIN)) > bibtex_pdf.log 
	@echo "Checking for rerun suggestion"
	@for ITER in 1 2 3 4; do \
		STABELIZED=`cat $(basename $(PDF_MAIN)).log | grep "Rerun"`; \
		if [ -z "$$STABELIZED" ]; then \
			echo "Document stabelized after $$ITER iterations"; \
			break; \
		fi; \
		echo "Document not stabelized, rerunning pdflatex"; \
		pdflatex $(basename $(PDF_MAIN)) > $(basename $(PDF_MAIN))_$$ITER.log; \
	done
	@echo "Copying pdf to target file"
	@cp $(basename $(PDF_MAIN)).pdf $(PDF_OUTPUT)
 
# the html output depends on the html tex files
# we have to process all of the bibliography files separately into xml files, 
# and then include them all in the call to the postprocessor
$(HTML_OUTPUT): $(HTML_TEX) 
	@echo "Running latexml on $(HTML_MAIN)"
	@latexml $(HTML_MAIN) --dest=$(basename $(HTML_OUTPUT)).xml > $(basename $(HTML_MAIN)).log 2>&1
	@BIBSTRING=""; \
	for BIBFILE in $(BIB); do \
		echo "Running latexml on $$BIBFILE"; \
		XMLFILE=`basename "$$BIBFILE" .bib`.xml; \
		LOGFILE=`basename "$$BIBFILE" .bib`_html.log; \
	    latexml $$BIBFILE --dest=$$XMLFILE > $$LOGFILE 2>&1; \
	    BIBSTRING="$$BIBSTRING --bibliography=$$XMLFILE"; \
	done; \
	echo $$BIBSTRING > bibstring.txt
	@echo "postprocessing with `cat bibstring.txt`"
	@latexmlpost $(basename $(HTML_OUTPUT)).xml `cat bibstring.txt` --dest=$(HTML_OUTPUT) --css=navbar-left.css
 
# the 2>/dev/null redirects stderr to the null device so that we don't get error
# messages in the console when rm has nothing to remove
clean:
	@-rm -v *.log 2>/dev/null
	@-rm -v *.out 2>/dev/null
	@-rm -v *.aux 2>/dev/null
	@-rm -v *.xml 2>/dev/null
	@-rm -v *.pdf 2>/dev/null
	@-rm -v *.html 2>/dev/null
	@-rm -v bibstring.txt 2>/dev/null

# The following definitions are the specifics of this project PDF_OUTPUT := document.pdf HTML_OUTPUT := document.html PDF_MAIN := document_pdf.tex HTML_MAIN := document_html.tex COMMON_TEX := document.tex \ preamble_common.tex PDF_TEX := $(COMMON_SRC) \ document_pdf.tex \ preamble_pdf.tex HTML_TEX := $(COMMON_SRC) \ document_html.tex \ preamble_html.tex BIB := references.bib # these variables are the dependencies for the outputs PDF_SRC := $(PDF_TEX) $(BIB) HTML_SRC := $(HTML_TEX) $(BIB) # the 'all' target will make both the pdf and html outputs all: pdf html # the 'pdf' target will make the pdf output pdf: $(PDF_OUTPUT) # the 'html' target will make the html output html: $(HTML_OUTPUT) # the pdf output depends on the pdf tex files # we use a shell script to optionally run pdflatex multiple times until the # output does not suggest that we rerun latex $(PDF_OUTPUT): $(PDF_TEX) @echo "Running pdflatex on $(PDF_MAIN)" @pdflatex $(basename $(PDF_MAIN)) > $(basename $(PDF_MAIN))_0.log @echo "Running bibtex" @-bibtex $(basename $(PDF_MAIN)) > bibtex_pdf.log @echo "Checking for rerun suggestion" @for ITER in 1 2 3 4; do \ STABELIZED=`cat $(basename $(PDF_MAIN)).log | grep "Rerun"`; \ if [ -z "$$STABELIZED" ]; then \ echo "Document stabelized after $$ITER iterations"; \ break; \ fi; \ echo "Document not stabelized, rerunning pdflatex"; \ pdflatex $(basename $(PDF_MAIN)) > $(basename $(PDF_MAIN))_$$ITER.log; \ done @echo "Copying pdf to target file" @cp $(basename $(PDF_MAIN)).pdf $(PDF_OUTPUT) # the html output depends on the html tex files # we have to process all of the bibliography files separately into xml files, # and then include them all in the call to the postprocessor $(HTML_OUTPUT): $(HTML_TEX) @echo "Running latexml on $(HTML_MAIN)" @latexml $(HTML_MAIN) --dest=$(basename $(HTML_OUTPUT)).xml > $(basename $(HTML_MAIN)).log 2>&1 @BIBSTRING=""; \ for BIBFILE in $(BIB); do \ echo "Running latexml on $$BIBFILE"; \ XMLFILE=`basename "$$BIBFILE" .bib`.xml; \ LOGFILE=`basename "$$BIBFILE" .bib`_html.log; \ latexml $$BIBFILE --dest=$$XMLFILE > $$LOGFILE 2>&1; \ BIBSTRING="$$BIBSTRING --bibliography=$$XMLFILE"; \ done; \ echo $$BIBSTRING > bibstring.txt @echo "postprocessing with `cat bibstring.txt`" @latexmlpost $(basename $(HTML_OUTPUT)).xml `cat bibstring.txt` --dest=$(HTML_OUTPUT) --css=navbar-left.css # the 2>/dev/null redirects stderr to the null device so that we don't get error # messages in the console when rm has nothing to remove clean: @-rm -v *.log 2>/dev/null @-rm -v *.out 2>/dev/null @-rm -v *.aux 2>/dev/null @-rm -v *.xml 2>/dev/null @-rm -v *.pdf 2>/dev/null @-rm -v *.html 2>/dev/null @-rm -v bibstring.txt 2>/dev/null

Some notes on the makefile. I execute bibtex ignoring errors (the dash symbol before ‘bibtex’) because bibtex will exit with an error if it doesn’t find any citations, or if there is no bibliography. Each iteration of pdflatex is output to a logfile named “document_pdf_<i>.log” where “<i>” is the iteration number. The output of pdflatex and bibtex is supressed by dumping it to the logfile (I the verbosity useless to have in the console).

The shell script in the PDF recipe iterates up to four times. The first thing it does is greps the output of the most recent run pdf latex looking for the line where latex recommends that we “Rerun” latex. If it finds such a line it sets the shell variable STABELIZED to that string. Otherwise it gets the empty string. Then we test to see if the string is empty. If it’s empty, we’re done so we break the loop. If it’s not, then we rerun pdflatex.

The shell script in the HTML recipe iterates over each of the (potentially multiple, potentially zero) bibliography files, processing each of them with latexml. It then appends the string “–bibliography=<filename>.xml” to the BIBSTRING shell variable. The last thing it does is echos the contents of that shell variable to the file “bibstring.txt”. This so so that subsequent commands by make can find it.

No Comments

ImgClip (xclip for images)

Posted by cheshirekow in Python on June 7, 2011

Here is a little python script I wrote to emulate xclip for image files. xclip, if you don’t know, is a simple command line tool for setting/retrieving text from the clipboard. For instance the following command

ls -l | xclip -i -selection clipboard

copies the current directory listing to the gnome clipboard, where it can then be ctrl + v pasted into a forum post, email, etc.

I really wanted something that does the same for image files. Unfortunately the following does not work:

cat image.png | xclip -i -selection clipboard

I’m not sure of the details of how the gnome clipboard works… but this doesn’t do it. I discovered a way to do it easily using pygtk. Here is a python script that does exactly what I want:

#! /usr/bin/python
 
import pygtk
pygtk.require('2.0')
import gtk
import os
import sys
 
def copy_image(f):
    assert os.path.exists(f), "file does not exist"
    image = gtk.gdk.pixbuf_new_from_file(f)
 
    clipboard = gtk.clipboard_get()
    clipboard.set_image(image)
    clipboard.store()
 
 
copy_image(sys.argv[1]);

P.S. I pasted this code into this post using the following command

cat imgclip.py | xclip -i -selection clipboard

Make sure to set the script to executable

chmod +x imgclip.py

And then use it like this

./imgclip.py /path/to/some/image.png

3 Comments

Brain Dump

Archive for category Programming

svg2pdf and svg2eps (convert svg to pdf or eps from the command line)

Generating HTML pages from Latex

Document Setup

Makefile

ImgClip (xclip for images)

Pages

Categories

Archives

other stuff