Elixir/Erlang gets some bad rep when it comes to String handling and the encoding of character data. Also, detecting/guessing the content encoding just based on heuristics is a reasonably hard problem to solve.
A recent Elixir/Phoenix project involved getting a remote HTML page with HTTPoison, modifying parts of it and returning the changed HTML document. It works perfectly fine when the source document is in UTF-8 unicode, but not so much when the source is ISO-8859-X (Latin1 and its siblings). This two-parter illustrates two mechanisms to get hints about the source document’s encoding.
Let’s assume that the HTML document contains a <meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
element in its document headers.
I’ll cover two cases: first, if the Content-Type
header response from the
remote webserver is missing (is that even allowed?) or not corresponding to
the actual encoding of the response body. And secondly, when the header and
the encoding match. Here goes the first part:
Part 1: The proper http-equiv meta element
Guess the content type with this one weird trick
Let’s employ Floki to get the header element from within the HTML document:
defmodule Latin1Convert do
@doc """
Retrieves the content type indication from `html`.
iex>"<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\"></head></html>" |> Latin1Convert.meta_http_equiv_encoding
"text/html; charset=ISO-8859-1"
iex>Latin1Convert.meta_http_equiv_encoding("<html></html>")
""
"""
@spec meta_http_equiv_encoding(String.t) :: String.t
def meta_http_equiv_encoding(html) do
String.downcase(html)
|> Floki.attribute("head > meta[http-equiv=content-type]", "content")
|> List.first
|> to_string
end
end
It’s easier on my brain to map this to atoms:
defmodule Latin1Convert do
@doc """
Looks for a <meta http-equiv="Content-Type"> node in the input
string's HTML header and returns an atom representing the encoding.
"""
@spec content_type_from_header(String.t) :: atom | nil
def content_type_from_header(html) do
encoding = meta_http_equiv_encoding(html)
cond do
Regex.match?(~r(iso-8859)i, encoding) ->
:latin1
Regex.match?(~r(utf-8)i, encoding) ->
:unicode
true ->
nil
end
end
def meta_http_equiv_encoding(html) do
# See above
end
end
Convert and purge (now) erroneous markup
The last step would be to convert the HTML input to UTF-8 using the underlying
Erlang library. However, we don’t want the HTML to identify as Latin1 anymore,
so we have to remove the meta http-equiv
tag:
defmodule Latin1Convert do
@doc """
Convert an input HTML string to UTF-8 unicode.
"""
@spec call(String.t) :: String.t
def call(html) do
content_type = content_type_from_header(html)
cond do
content_type == :latin1 ->
html
|> :unicode.characters_to_binary(:latin1)
|> remove_meta_http_equiv_encoding
true ->
html
end
end
# Caveat: not really case-sensitive check for the DOM node.
# Floki doesn't seem to understand `$=foo i` queries. We can't
# `String.downcase` here as that will mess up the filter chain.
defp remove_meta_http_equiv_encoding(html) do
Floki.filter_out(html, "head > meta[http-equiv*=ontent-]")
|> Floki.raw_html
end
def content_type_from_header(html) do
# see above
end
def meta_http_equiv_encoding(html) do
# See above
end
end
The next part will cover how to take a matching Content-Type
HTTP header to
short-circuit the guesswork. Check out the file so far: