Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

sammo3182 · 2021-07-24T00:16:23Z

stri_detect_regex looks not recognizing Chinese characters correctly when it is treated as a regex pattern. I'm using the 1.4.0.9000 dev version on R 4.1.0. Here's an example:

Sys.setlocale(, "Chinese")
library(stringi)

stri_detect_fixed("昌平区", "县") # Works fine
#> [1] FALSE
stri_detect_regex("昌平区", "县") # TRUE
#> [1] TRUE
grepl("县", "昌平区") # FALSE
#> [1] FALSE

Another example:

library(dplyr)
library(rvest)
library(stringi)

link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"

tx_xi <- read_html(link_speech) %>% 
  html_nodes("p") %>%
    html_text

stri_detect_regex(tx_xi, "同志们")  #Note that these are the very first three characters of the speech

#> [1] FALSE

sessionInfo()
#> R Under development (unstable) (2021-05-17 r80314)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#>  [1] LC_COLLATE=Chinese (Simplified)_China.936 
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> system code page: 65001
#>
#> attached base packages:
#>  [1] stats     graphics  grDevices utils     datasets  methods  
#> [7] base     
#>
#> other attached packages:
#>   [1] stringi_1.7.3
#>
#> loaded via a namespace (and not attached):
#>   [1] compiler_4.2.0 tools_4.2.0    parallel_4.2.0

The issue was submitted to stringr (tidyverse/stringr#386 (comment)), but it looks like a stringi problem?

The text was updated successfully, but these errors were encountered:

gagolews · 2021-07-24T00:44:06Z

I cannot reproduce the above; I get:

>  library("stringi")
> stri_detect_regex("昌平区", "县")
[1] FALSE
> stri_detect_fixed("昌平区", "县")
[1] FALSE
> grepl("县", "昌平区") 
[1] FALSE
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.7.3

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0   
>

What does stri_escape_unicode() return on your platform when run on both strings (pattern, search string)? How about charToRaw()? How about utf8ToInt()?
Can you try with a more recent version of the stringi package?

gagolews · 2021-07-24T00:44:46Z

Also, could you please show me the result of a call to stri_info(FALSE)?

gagolews · 2021-07-24T00:47:43Z

With the latter, I get:

stri_detect_regex(tx_xi, "同志们") 
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[69] FALSE FALSE
> tx_xi[1]
[1] "在庆祝中国共产党成立100周年大会上的讲话"

sammo3182 · 2021-07-24T04:10:58Z

I cannot reproduce the above; I get:

>  library("stringi")
> stri_detect_regex("昌平区", "县")
[1] FALSE
> stri_detect_fixed("昌平区", "县")
[1] FALSE
> grepl("县", "昌平区") 
[1] FALSE
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.7.3

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0   
>

What does stri_escape_unicode() return on your platform when run on both strings (pattern, search string)? How about charToRaw()? How about utf8ToInt()?
Can you try with a more recent version of the stringi package?

Marek, first, thank you so much for helping me with this!!
One reason you didn't reproduce my result may be that you alternates the Sys.setlocate to chinese as I showed in the first line of the example. It's important; without it, many outputs in Chinese would just returned the hex unicodes or utf-8 codes. (Yihui has talked about this in many places).

Per your questions, here are what I got:

> stri_escape_unicode("昌平区")
Error in stri_escape_unicode("昌平区") : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode("县")
Error in stri_escape_unicode("县") : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> 
> # According to the error message, I did the the folliwng
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> ?stri_enc_toutf8
> # According to the error message, I did the the folliwng
> stri_enc_toutf8("昌平区")
[1] "昌平区"
> stri_enc_toutf8("县")
[1] "县"
> 
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode(stri_enc_toutf8("县"))
Error in stri_escape_unicode(stri_enc_toutf8("县")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> 
> 
> charToRaw("昌平区")
[1] b2 fd c6 bd c7 f8
> charToRaw("县")
[1] cf d8
> 
> utf8ToInt("昌平区")
[1] NA
> utf8ToInt("县")
[1] NA

> stri_info(FALSE)
$Unicode.version
[1] "13.0"

$ICU.version
[1] "69.1"

$Locale
$Locale$Language
[1] "en"

$Locale$Country
[1] "US"

$Locale$Variant
[1] ""

$Locale$Name
[1] "en_US"


$Charset.internal
[1] "UTF-8"  "UTF-16"

$Charset.native
$Charset.native$Name.friendly
[1] "UTF-8"

$Charset.native$Name.ICU
[1] "UTF-8"

$Charset.native$Name.UTR22
[1] NA

$Charset.native$Name.IBM
[1] "ibm-1208"

$Charset.native$Name.WINDOWS
[1] "windows-65001"

$Charset.native$Name.JAVA
[1] "UTF-8"

$Charset.native$Name.IANA
[1] "UTF-8"

$Charset.native$Name.MIME
[1] "UTF-8"

$Charset.native$ASCII.subset
[1] TRUE

$Charset.native$Unicode.1to1
[1] NA

$Charset.native$CharSize.8bit
[1] FALSE

$Charset.native$CharSize.min
[1] 1

$Charset.native$CharSize.max
[1] 3


$ICU.system
[1] FALSE

$ICU.UTF8
[1] FALSE

>

Does the last couple of lines indicate anything?

sammo3182 · 2021-07-24T04:17:49Z

With the latter, I get:

stri_detect_regex(tx_xi, "同志们") 
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[69] FALSE FALSE
> tx_xi[1]
[1] "在庆祝中国共产党成立100周年大会上的讲话"

Sorry for the confusion. My bad for the miscoding. The problem remains, though. Try this:

library(dplyr)
library(rvest)
library(stringi)
#> 
link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"

tx_xi <- read_html(link_speech) %>% 
+     html_nodes("p") %>%
+     html_text 

tx_xi[6]
#> [1] "同志们，朋友们："
stri_detect_regex(tx_xi[6], "同志们")  #Note that these are the very first three characters of the speech
#> [1] FALSE
#>

gagolews · 2021-07-24T07:04:33Z

I think the problem is due to:

[2] LC_CTYPE=Chinese (Simplified)_China.936   
...
system code page: 65001

ICU thinks your native encoding is UTF-8, whereas it's probably GBK.

Could you give stri_enc_set("Windows-936") a try?

sammo3182 · 2021-07-26T00:24:46Z

My, it works! It looks that the error is indeed attributed to the ICU encoding recognition. Once the Windows-936 is set, both the above cases work well! Thank you so much, Marek, for helping me with this issue! I'm not sure if this is an issue only for recognizing Chinese on a PC, but I bet many text analysts would appreciate knowing this issue and the solution above!

gagolews · 2021-07-26T00:28:42Z

Great, I changed the title of the issue so that it's more searchable.

To sum up, the solution was:

stri_enc_set("Windows-936")

sammo3182 · 2021-07-26T00:39:42Z

A quick follow-up question: is there any tradeoff by changing the stringi encoding? Or is there a way to let stringi recognize Chinese characters in UTF-8 as UTF-8? The encoding converter seem not to make any difference at all without str_enc_set:

# No str_enc_set is conducted
stri_detect_regex(stri_conv("昌平区", to = "UTF8"), stri_conv("县", to = "UTF8")) 
#> [1] TRUE
# The correct outcome should be false, since the "县" isn't in "昌平区"

gagolews · 2021-07-26T00:46:49Z

I get FALSE. I think the problem might as well be on your system side, not just stringi, but it's worth digging into it.

Can you call:

charToRaw(stri_conv("昌平区", to = "UTF8"))
charToRaw(stri_conv("县", to = "UTF8"))
charToRaw("昌平区")
charToRaw("县")
stri_enc_mark("昌平区")
stri_enc_mark("县")

Also, try iconv instead of stri_conv

gagolews · 2021-07-26T00:47:44Z

Also, maybe the most recent R - UCRT is worth giving a try? https://github.com/r-windows/docs/blob/master/ucrt.md

sammo3182 · 2021-07-26T04:12:25Z

iconv works. The PC system is definitely a primary part of the reason of this issue. Nevertheless, I guess, my situate can represent the most system environment of R users in China. In that case, either a stri_enc_set or iconv would work. Of course, if the stringi can offer an argument to do so automatically, it would be great, ha-ha!

Regarding the UCRT, it is definitely intriguing, but it looks only about writing packages? I didn't see there's an instruction showing how I can automatically let Windows to convert everything to UTF-8 at the input stage. If not, UCRT won't be that different from manually converting to UTF-8 with inconv, no?

#> [1] ef bf bd ef bf bd c6 bd ef bf bd ef bf bd
charToRaw(stri_conv("县", to = "UTF8"))
#> [1] ef bf bd ef bf bd
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("县")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("县")
#> [1] "native"

stri_detect_regex(iconv("昌平区", to = "UTF8"), "县") # supposed to be FALSE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), "县") # supposed to be TRUE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), iconv("县", to = "UTF8")) # supposed to be FALSE
#> [1] TRUE

gagolews · 2021-07-26T04:28:01Z

Hmmm... are these really generated with stri_enc_set("Windows-936") in place? This needs to be called each time the package is loaded.

The byte sequence ef bf bd denotes the replacement character ("unknown") btw

sammo3182 · 2021-07-26T04:35:13Z

Oh, I might mislead you! The above outputs were produced without setting the stri_enc_set. As asked in #448 (comment), I was seeking solutions that I don't have to reset the stri_enc_set. Everything works fine when the encoding is manually set:

library(stringi)
stri_enc_set("Windows-936")
#> New settings: stringi_1.7.3 (en_US.GBK; ICU4C 69.1 [bundle]; Unicode 13.0)
#> Warning message:
#> In stri_info(short = TRUE) :
#>   Your native charset does not map to Unicode well. This may cause serious problems. Consider switching to UTF-8.
charToRaw(stri_conv("昌平区", to = "UTF8"))
#> [1] e6 98 8c e5 b9 b3 e5 8c ba
charToRaw(stri_conv("县", to = "UTF8"))
#> [1] e5 8e bf
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("县")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("县")
#> [1] "native"

gagolews · 2021-07-26T05:06:12Z

:)

Dear all, has anyone working in this locale experienced similar issues?

gagolews changed the title ~~Problem of detecting Chinese characters~~ Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding Jul 26, 2021

gagolews closed this as completed Jul 26, 2021

gagolews reopened this Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

sammo3182 commented Jul 24, 2021 •

edited

Loading

gagolews commented Jul 24, 2021

gagolews commented Jul 24, 2021

gagolews commented Jul 24, 2021 •

edited

Loading

sammo3182 commented Jul 24, 2021

sammo3182 commented Jul 24, 2021

gagolews commented Jul 24, 2021

sammo3182 commented Jul 26, 2021

gagolews commented Jul 26, 2021

sammo3182 commented Jul 26, 2021

gagolews commented Jul 26, 2021

gagolews commented Jul 26, 2021

sammo3182 commented Jul 26, 2021

gagolews commented Jul 26, 2021

sammo3182 commented Jul 26, 2021

gagolews commented Jul 26, 2021 •

edited

Loading

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

Comments

sammo3182 commented Jul 24, 2021 • edited Loading

gagolews commented Jul 24, 2021

gagolews commented Jul 24, 2021

gagolews commented Jul 24, 2021 • edited Loading

sammo3182 commented Jul 24, 2021

sammo3182 commented Jul 24, 2021

gagolews commented Jul 24, 2021

sammo3182 commented Jul 26, 2021

gagolews commented Jul 26, 2021

sammo3182 commented Jul 26, 2021

gagolews commented Jul 26, 2021

gagolews commented Jul 26, 2021

sammo3182 commented Jul 26, 2021

gagolews commented Jul 26, 2021

sammo3182 commented Jul 26, 2021

gagolews commented Jul 26, 2021 • edited Loading

sammo3182 commented Jul 24, 2021 •

edited

Loading

gagolews commented Jul 24, 2021 •

edited

Loading

gagolews commented Jul 26, 2021 •

edited

Loading