-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust): support simd approach in converting utf16 to utf8 #1778
base: main
Are you sure you want to change the base?
Conversation
This is too huge, do we have a better way to implement it @theweipeng @kitty-eu-org |
let current_dir = env::current_dir() | ||
.expect("Failed to get current directory") | ||
.join("benches"); | ||
let path1 = current_dir.join("chinese.utf8.txt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's only used here, could we use a smaller test data? I think two or three string line of literal is enough for benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll replace it with some randomly generated string
I also think about it. Actually I have seen the CPP version implementation in fury: https://github.com/apache/fury/pull/1732/files , but the SIMD API seems only used at swap endian and
rust version
It is easy to shorten the code . Just remove some functions. And I just write a short-version using #[target_feature(enable = "avx", enable = "avx2", enable = "sse2")]
pub unsafe fn utf16_to_utf8_only_check(
utf16: &[u16],
is_little_endian: bool,
) -> Result<String, String> {
let mut utf8_bytes: Vec<u8> = Vec::with_capacity(utf16.len() * 3);
let ptr_8 = utf8_bytes.as_mut_ptr();
let ptr_16 = utf16.as_ptr();
let mut offset_8 = 0;
let mut offset_16 = 0;
let len_16 = utf16.len();
while offset_16 + CHUNK_UTF16_USAGE <= len_16 {
let mut chunk = _mm256_loadu_si256(ptr_16.add(offset_16) as *const __m256i);
chunk = if is_little_endian == super::super::IS_LITTLE_ENDIAN_LOCAL {
chunk
} else {
_mm256_shuffle_epi8(chunk, *ENDIAN_SWAP_MASK)
};
let mask1 = _mm256_cmpgt_epi16(chunk, *limit1);
// check chunk's all u16 less than 0x80 ,1 utf16 -> 1utf8
if _mm256_testz_si256(mask1, mask1) != 0 {
let utf8_packed = _mm_packus_epi16(
_mm256_castsi256_si128(chunk),
_mm256_extractf128_si256(chunk, 1),
);
_mm_storeu_si128(ptr_8.add(offset_8) as *mut __m128i, utf8_packed);
offset_8 += CHUNK_UTF16_USAGE;
offset_16 += CHUNK_UTF16_USAGE;
continue;
}
// when has some utf16 can convert to 2/3/4 utf8 bytes
let res = call_fallback(
ptr_16,
ptr_8,
&mut offset_16,
&mut offset_8,
len_16,
is_little_endian,
);
if let Some(err_msg) = res {
return Err(err_msg);
}
}
// dealing with remaining u16 not enough to form a chunk.
if offset_16 < len_16 {
let suffix_utf16 =
std::slice::from_raw_parts(ptr_16.add(offset_16), len_16 - offset_16);
let res = super::super::utf16_to_utf8_fallback(
suffix_utf16,
ptr_8.add(offset_8),
is_little_endian,
);
if res.is_err() {
return Err(res.err().unwrap());
}
offset_8 += res.unwrap();
}
utf8_bytes.set_len(offset_8);
Ok(String::from_utf8(utf8_bytes).unwrap())
} By the way,I fully support shortening the code. This huge code is too ugly for fury.😄 |
@urlyy I wrote a rust version, the neon performance improvement is about twice as good, we can optimize it together my "Standard" is your normal |
@urlyy I'm optimizing SIMD code for SSE2 and AVX |
@urlyy neon I referred to part of your implementation, thank you |
I'm new to SIMD, and my implementation is referenced from |
What does this PR do?
For the conversion from UTF-16 to UTF-8, a SIMD method based on AVX/SSE/NEON instruction sets was added on the basis of #1730 , and benchmarks were written.
referencing
Notice:
simdutf
. But it takes1600
lines.util.rs
might need to be merged withstring_util.rs
util.rs
code might be too long for you. The algorithm is first splitting utf16 bytes into chunks ,then converting a 256/128 bits chunk to utf8 bytesat one time
, not using loop (except case-4), for 4 casesExample:
First we use bitwise operations included in SIMD to convert
0x[ 00ab pqrs ]
->0x[ 00ab qwer ]
. Assume that 00ab is for 1 utf16 -> 1 utf8 , pqrs is for 1 utf16 -> 2 utf8Then we should remove the unneeded
00
, withshuffle: array = table[idx][1...]=[1,2,3,0]
(not correspond to real data), we convert0x[ 00ab qwer ]
--simd_shuffle_func-->0x[abqw erxx]
. And as we getfinal_length=table[idx][0]=3
, we can setlen=len+3
,not4
. Althouth we actually store0x[abqw erxx]
, the pre-allocating has ensure noindex out of bounds
.Related issues
Does this PR introduce any user-facing change?
Benchmark
dataset from https://github.com/lemire/unicode_lipsum/tree/main/wikipedia_mars
Both SIMD and non-SIMD approach are faster than using
String::from_utf16(bytes)
.In my win11 x86 machine benchmark , SIMD approach seems to be approximately only a little faster than normal approach , that is out of my expectation. AVX seems better than SSE because AVX handle 256bit at one time but SSE onlyt handle 128 bits at one time. When handling with surrogate pair, algorithm will use fall_back (normal, without SIMD) way, in this case simd approach might be worse than normal way.