Concept of the mb_substr()
Function
The mb_substr()
function slices a string starting from a specified position (offset) and for a specified length, returning the extracted substring while safely handling multibyte characters.
While mb_substr()
performs almost the same role as substr()
, its key advantage is that it safely handles multibyte strings — such as those encoded in UTF-8 — without corrupting characters.
Encoding Issues with the substr()
Function
example, consider the Japanese string 'こんにちは世界'
("Hello, World" in Japanese):
echo substr('こんにちは世界', 0, 5);
substr()
operates on byte offsets, potentially splitting multibyte characters in the middle.
Fixing Encoding Issues with the mb_substr()
Function
mb_substr()
to correctly handle multibyte strings
echo mb_substr('こんにちは世界', 0, 5);
When working with strings in UTF-8 or other multibyte encodings, characters such as Japanese, Korean, or Chinese use multiple bytes per character, unlike English letters or digits which use only one byte.
Multibyte encodings represent each character using two or more bytes. Since substr()
cuts strings at the byte level, slicing within a multibyte character can cause corrupted or unreadable output.
The mb_substr()
function solves this problem by operating on character counts rather than byte offsets, ensuring safe and accurate substring extraction.
Note:
The mb_substr()
function is a safe, multibyte-aware alternative to substr()
.
Syntax
mb_substr(
string $string,
int $start,
?int $length = null,
?string $encoding = null
): string
Parameters
$string |
Required. The original string from which the substring will be extracted. |
---|---|
$start |
Required. The starting position for extraction.
The index starts at 0 , meaning the first character of the string is at index 0 . If a negative value is used, it counts backward from the end of the string. For example, -1 refers to the last character, and -2 refers to the second to last character. |
$length |
Optional. The length of the substring to extract.
The default value is null , which means extracting all characters from the start position to the end of the string. |
$encoding |
Optional. The $encoding parameter is the character encoding. If it is omitted or null , the internal character encoding value will be used. |
Return Values
The mb_substr()
function returns the extracted substring when the operation is successful.
Changelog
Version | Description |
---|---|
8.0.0 | The $encoding parameter can now be set to null . When null or omitted, the function uses the default character encoding automatically. |
Practical Examples
The following examples demonstrate how the mb_substr()
function behaves in various scenarios—including basic usage with multibyte characters and working with negative offset values.
Basic Usage
$originalString = 'こんにちは、はじめまして!';
$start = 3; // Starting from the 4th character (0-based index)
$length = 5; // Extract 5 characters
$extractedString = mb_substr($originalString, $start, $length);
echo 'Extracted substring: ' . $extractedString;
// Output: Extracted substring: ちは、はじ
In this example, mb_substr()
extracts 5 characters starting from the 4th character of a Japanese string. Since it counts characters (not bytes), it preserves the integrity of multibyte characters like Japanese kana and punctuation.
Using Negative Values for the $start
Parameter: Counting from the End of the String
When the $start
parameter is a negative number, mb_substr()
starts counting from the end of the string. For example, -1
refers to the last character, -2
to the second-to-last, and so on.
$originalString = 'こんにちは、はじめまして!';
$start = -6; // Start 6 characters from the end
$length = 4; // Extract 4 characters
$extractedString = mb_substr($originalString, $start, $length);
echo 'Extracted substring: ' . $extractedString;
// Output: Extracted substring: じめまし
Here, $start
is set to -6
, which tells mb_substr()
to start from the sixth character from the end of the string. It then extracts four characters, resulting in a clean, valid substring with no broken multibyte characters.