Decode HTML Objects in Emacs / Elisp

Some websites like to encode all their text through HTML objects, so instead of viewing the text as

So I'm looking 

You get something like:

 So I'm looking  

I was wondering if there is a built-in way to translate encoded text into plain text using any Emacs or if I have to declare my line map ("& 83" => "S" ...) and manually decode it using the map.

Any pointers would be greatly appreciated.

+4
source share
3 answers

I wrote this function to work with non-numeric unicode objects, if anyone needs it.

 (defun html-entities-to-unicode (string) (let* ((plist '(Aacute "Á" aacute "á" Acirc "Â" acirc "â" acute "´" AElig "Æ" aelig "æ" Agrave "À" agrave "à" alefsym "ℵ" Alpha "Α" alpha "α" amp "&" and "∧" ang "∠" apos "'" aring "å" Aring "Å" asymp "≈" atilde "ã" Atilde "Ã" auml "ä" Auml "Ä" bdquo "„" Beta "Β" beta "β" brvbar "¦" bull "•" cap "∩" ccedil "ç" Ccedil "Ç" cedil "¸" cent "¢" Chi "Χ" chi "χ" circ "ˆ" clubs "♣" cong "≅" copy "©" crarr "↵" cup "∪" curren "¤" Dagger "‡" dagger "†" darr "↓" dArr "⇓" deg "°" Delta "Δ" delta "δ" diams "♦" divide "÷" eacute "é" Eacute "É" ecirc "ê" Ecirc "Ê" egrave "è" Egrave "È" empty "∅" emsp " " ensp " " Epsilon "Ε" epsilon "ε" equiv "≡" Eta "Η" eta "η" eth "ð" ETH "Ð" euml "ë" Euml "Ë" euro "€" exist "∃" fnof "ƒ" forall "∀" frac12 "½" frac14 "¼" frac34 "¾" frasl "⁄" Gamma "Γ" gamma "γ" ge "≥" gt ">" harr "↔" hArr "⇔" hearts "♥" hellip "…" iacute "í" Iacute "Í" icirc "î" Icirc "Î" iexcl "¡" igrave "ì" Igrave "Ì" image "ℑ" infin "∞" int "∫" Iota "Ι" iota "ι" iquest "¿" isin "∈" iuml "ï" Iuml "Ï" Kappa "Κ" kappa "κ" Lambda "Λ" lambda "λ" lang "〈" laquo "«" larr "←" lArr "⇐" lceil "⌈" ldquo """ le "" lfloor "" lowast "" loz "" lrm "" lsaquo "" lsquo "'" lt "<" macr "¯" mdash "" micro "µ" middot "·" minus "" Mu "Μ" mu "μ" nabla "" nbsp "" ndash "" ne "" ni "" not "¬" notin "" nsub "" ntilde "ñ" Ntilde "Ñ" Nu "Ν" nu "ν" oacute "ó" Oacute "Ó" ocirc "ô" Ocirc "Ô" OElig "Œ" oelig "œ" ograve "ò" Ograve "Ò" oline "" omega "ω" Omega "Ω" Omicron "Ο" omicron "ο" oplus "" or "" ordf "ª" ordm "º" oslash "ø" Oslash "Ø" otilde "õ" Otilde "Õ" otimes "" ouml "ö" Ouml "Ö" para "" part "" permil "" perp "" Phi "Φ" phi "φ" Pi "Π" pi "π" piv "ϖ" plusmn "±" pound "£" Prime "" prime "" prod "" prop "" Psi "Ψ" psi "ψ" quot "\"" radic "√" rang "〉" raquo "»" rarr "→" rArr "⇒" rceil "⌉" rdquo """ real "" reg "®" rfloor "" Rho "Ρ" rho "ρ" rlm "" rsaquo "" rsquo "'" sbquo "" scaron "š" Scaron "Š" sdot "" sect "§" shy "" Sigma "Σ" sigma "σ" sigmaf "ς" sim "" spades "" sub "" sube "" sum "" sup "" sup1 "¹" sup2 "²" sup3 "³" supe "" szlig "ß" Tau "Τ" tau "τ" there4 "" Theta "Θ" theta "θ" thetasym "ϑ" thinsp " " thorn "þ" THORN "Þ" tilde "˜" times "×" trade "" uacute "ú" Uacute "Ú" uarr "" uArr "" ucirc "û" Ucirc "Û" ugrave "ù" Ugrave "Ù" uml "¨" upsih "ϒ" Upsilon "Υ" upsilon "υ" uuml "ü" Uuml "Ü" weierp "" Xi "Ξ" xi "ξ" yacute "ý" Yacute "Ý" yen "¥" yuml "ÿ" Yuml "Ÿ" Zeta "Ζ" zeta "ζ" zwj "" zwnj "")) (get-function (lambda (s) (or (plist-get plist (intern (substring s 1 -1))) s)))) (replace-regexp-in-string "&[^; ]*;" get-function string))) 
+2
source

I wrote the following, which does what you need, @ federico-builes. (I need the same thing.)

 (defun ajs-decimal-escapes-to-unicode (start end) "Convert escapes like '&#955;' to Unicode like 'λ'. Operates on the active region or the whole buffer." (interactive (list (point) (mark))) (or (use-region-p) (setq start (point-min) end (point-max))) (insert (replace-regexp-in-string "&#[0-9]*;" (lambda (match) (format "%c" (string-to-number (substring match 2 -1)))) (filter-buffer-substring start end t)))) 

@Konr's answer was helpful - thanks! I also liked the Introduction to Programming in Emacs Lisp . This is the first Lisp I wrote that might be useful. I would appreciate feedback, even for things like spaces; thanks!

+1
source

I don’t know if there is a built-in function, but this small function can do the job:

 (defun my-insert-encode-entities-string (str) (mapconcat (lambda (char) (format "&#%d;" char)) (string-to-list str) "")) 

If you only want to encode HTML objects, use url-insert-entities-in-string .

0
source

All Articles