Manual Page Search Parameters

MBTOWC(3) Library Functions Manual MBTOWC(3)

mbtowcconverts a multibyte character to a wide character

#include <stdlib.h>

int
mbtowc(wchar_t * restrict pwc, const char * restrict s, size_t n);

The () function converts the multibyte character pointed to by s to a wide character, and stores it in the wchar_t object pointed to by pwc. This function may inspect at most n bytes of the array pointed to by s.

Unlike mbrtowc(3), the first n bytes pointed to by s need to form an entire multibyte character. Otherwise, this function returns an error and the internal state will be undefined.

If a call to () results in an undefined internal state, parsing of the string starting at s cannot continue, not even at a later byte, and mbtowc() must be called with s set to NULL to reset the internal state before it can safely be used again on a different string.

The behaviour of () is affected by the LC_CTYPE category of the current locale. Calling any other functions in never changes the internal state of mbtowc(), except for calling setlocale(3) with the LC_CTYPE category set to a different locale. Such setlocale(3) calls cause the internal state of this function to be undefined.

In state-dependent encodings such as ISO/IEC 2022-JP, s may point to the special sequence of bytes to change the shift-state. Because such sequence bytes do not correspond to any individual wide character, () treats them as if they were part of the subsequent multibyte character.

The following special cases apply to the arguments:

s == NULL
() initializes its own internal state to the initial state, and determines whether the current encoding is state-dependent. mbtowc() returns 0 if the encoding is state-independent, otherwise non-zero. pwc is ignored.
pwc == NULL
mbtowc() behaves just as if pwc was not NULL, including modifications to internal state, except that the result of the conversion is discarded. This can be used to determine the size of the wide character representation of a multibyte string. Another use case is a check for illegal or incomplete multibyte sequences.
n == 0
In this case, the first n bytes of the array pointed to by s never form a complete character and mbtowc() always fails.

Normally, mbtowc() returns:

0
s points to a null byte (‘\0’).
positive
Number of bytes for the valid multibyte character pointed to by s. There are no cases where the value returned is greater than the value of the MB_CUR_MAX macro.
-1
s points to an invalid or an incomplete multibyte character. errno is set to indicate the error.

When s is NULL, mbtowc() returns:

0
The current encoding is state-independent.
non-zero
The current encoding is state-dependent.

The following program parses a UTF-8 string and reports encoding errors:

#include <limits.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>

int
main(void)
{
	char	 s[LINE_MAX];
	wchar_t	 wc;
	int	 i, len;

	setlocale(LC_CTYPE, "C.UTF-8");
	if (fgets(s, sizeof(s), stdin) == NULL)
		*s = '\0';
	for (i = 0, len = 1; len != 0; i += len) {
		switch (len = mbtowc(&wc, s + i, MB_CUR_MAX)) {
		case 0:
			printf("byte %d end of string 0x00\n", i);
			break;
		case -1:
			printf("byte %d invalid 0x%0.2hhx\n", i, s[i]);
			len = 1;
			break;
		default:
			printf("byte %d U+%0.4X %lc\n", i, wc, wc);
			break;
		}
	}
	return 0;
}

Recovering from encoding errors and continuing to parse the rest of the string as shown above is only possible for state-independent character encodings. For full generality, the error handling can be modified to reset the internal state. In that case, the rest of the string has to be skipped if the encoding is state-dependent:

		case -1:
			printf("byte %d invalid 0x%0.2hhx\n", i, s[i]);
			len = !mbtowc(NULL, NULL, MB_CUR_MAX);
			break;

mbtowc() will set errno in the following cases:

[]
s points to an invalid or incomplete multibyte character.

mblen(3), mbrtowc(3), setlocale(3)

The mbtowc() function conforms to ANSI X3.159-1989 (“ANSI C89”). The restrict qualifier is added at ISO/IEC 9899:1999 (“ISO C99”). Setting errno is an IEEE Std 1003.1-2008 (“POSIX.1”) extension.

On error, callers of mbtowc() cannot tell whether the multibyte character was invalid or incomplete. To treat incomplete data differently from invalid data the mbrtowc(3) function can be used instead.

November 11, 2023 current