Valid content type for XML, HTML, and XHTML documents

What are the right content types for XML, HTML, and XHTML documents?

I need to write a simple crawler that extracts only these files.

Currently, http://example.net/index.html can serve, for example, as a JPEG file due to mod_rewrite, so I need to check the content type from the response header and compare it with the list of allowed content types.

Where can I get such a list?

+74
html xml xhtml web-standards
Jun 03 2018-10-06T00:
source share
1 answer

HTML: text/html , full stop.

XHTML: application/xhtml+xml , or only if you follow the recommendations for compatibility with HTML, text/html . See W3 Media Note .

XML: text/xml , application/xml ( RFC 2376 ).

There are also many other types of XML-based media, such as application/rss+xml or image/svg+xml . It is a safe bet that any unrecognized but registered termination in +xml is XML based. See the IANA list for registered media types ending in +xml .

(For unregistered types x- all bets are disabled, but you hope that +xml will be respected.)

+129
Jun 03 '10 at 12:01
source share



All Articles