Output of file(1)

pmk@lemmy.sdf.org · 1 year ago

Output of file(1)

tycho@lemmy.sdf.org · 1 year ago

I explored the source of file(1) and the part to determine file types of text file seems to be in text.c: https://cvsweb.openbsd.org/cgi-bin/cvsweb/~checkout~/src/usr.bin/file/text.c?rev=1.3&content-type=text/plain

And especially this part:

static int
text_try_test(const void *base, size_t size, int (*f)(u_char))
{
	const u_char	*data = base;
	size_t		 offset;

	for (offset = 0; offset &lt; size; offset++) {
		if (!f(data[offset]))
			return (0);
	}
	return (1);
}

const char *
text_get_type(const void *base, size_t size)
{
	if (text_try_test(base, size, text_is_ascii))
		return ("ASCII");
	if (text_try_test(base, size, text_is_latin1))
		return ("ISO-8859");
	if (text_try_test(base, size, text_is_extended))
		return ("Non-ISO extended-ASCII");
	return (NULL);
}

So file(1) is not capable of saying if a file is UTF-8 right now. There is some other file (/etc/magic) which can help to determine if a text file is UTF-7 or UTF-8-EBCDIC because those need a BOM but as you said UTF-8 does not need a BOM. So it looks like we are stuck here :)

z3bra@lemmy.sdf.org · 1 year ago

Which is ironic, given that OpenBSD only supports the UTF-8 encoding :)

tycho@lemmy.sdf.org · 1 year ago

Yes it looks like utf8 is a first-class citizen but really it is ASCII which is 100% supported. From the FAQ:

The OpenBSD base system fully supports the ASCII character set and encoding, and partially supports the UTF-8 encoding of the Unicode character set.