-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448
Comments
I cannot reproduce the above; I get: > library("stringi")
> stri_detect_regex("昌平区", "县")
[1] FALSE
> stri_detect_fixed("昌平区", "县")
[1] FALSE
> grepl("县", "昌平区")
[1] FALSE
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
[4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.7.3
loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0
>
|
Also, could you please show me the result of a call to |
With the latter, I get:
|
Marek, first, thank you so much for helping me with this!! Per your questions, here are what I got: > stri_escape_unicode("昌平区")
Error in stri_escape_unicode("昌平区") :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode("县")
Error in stri_escape_unicode("县") :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
>
> # According to the error message, I did the the folliwng
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> ?stri_enc_toutf8
> # According to the error message, I did the the folliwng
> stri_enc_toutf8("昌平区")
[1] "昌平区"
> stri_enc_toutf8("县")
[1] "县"
>
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode(stri_enc_toutf8("县"))
Error in stri_escape_unicode(stri_enc_toutf8("县")) :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
>
>
> charToRaw("昌平区")
[1] b2 fd c6 bd c7 f8
> charToRaw("县")
[1] cf d8
>
> utf8ToInt("昌平区")
[1] NA
> utf8ToInt("县")
[1] NA
> stri_info(FALSE)
$Unicode.version
[1] "13.0"
$ICU.version
[1] "69.1"
$Locale
$Locale$Language
[1] "en"
$Locale$Country
[1] "US"
$Locale$Variant
[1] ""
$Locale$Name
[1] "en_US"
$Charset.internal
[1] "UTF-8" "UTF-16"
$Charset.native
$Charset.native$Name.friendly
[1] "UTF-8"
$Charset.native$Name.ICU
[1] "UTF-8"
$Charset.native$Name.UTR22
[1] NA
$Charset.native$Name.IBM
[1] "ibm-1208"
$Charset.native$Name.WINDOWS
[1] "windows-65001"
$Charset.native$Name.JAVA
[1] "UTF-8"
$Charset.native$Name.IANA
[1] "UTF-8"
$Charset.native$Name.MIME
[1] "UTF-8"
$Charset.native$ASCII.subset
[1] TRUE
$Charset.native$Unicode.1to1
[1] NA
$Charset.native$CharSize.8bit
[1] FALSE
$Charset.native$CharSize.min
[1] 1
$Charset.native$CharSize.max
[1] 3
$ICU.system
[1] FALSE
$ICU.UTF8
[1] FALSE
> Does the last couple of lines indicate anything? |
Sorry for the confusion. My bad for the miscoding. The problem remains, though. Try this: library(dplyr)
library(rvest)
library(stringi)
#>
link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"
tx_xi <- read_html(link_speech) %>%
+ html_nodes("p") %>%
+ html_text
tx_xi[6]
#> [1] "同志们,朋友们:"
stri_detect_regex(tx_xi[6], "同志们") #Note that these are the very first three characters of the speech
#> [1] FALSE
#> |
I think the problem is due to:
ICU thinks your native encoding is UTF-8, whereas it's probably GBK. Could you give |
My, it works! It looks that the error is indeed attributed to the ICU encoding recognition. Once the |
Great, I changed the title of the issue so that it's more searchable. To sum up, the solution was:
|
A quick follow-up question: is there any tradeoff by changing the stringi encoding? Or is there a way to let # No str_enc_set is conducted
stri_detect_regex(stri_conv("昌平区", to = "UTF8"), stri_conv("县", to = "UTF8"))
#> [1] TRUE
# The correct outcome should be false, since the "县" isn't in "昌平区" |
I get Can you call:
Also, try |
Also, maybe the most recent R - UCRT is worth giving a try? https://github.com/r-windows/docs/blob/master/ucrt.md |
Regarding the UCRT, it is definitely intriguing, but it looks only about writing packages? I didn't see there's an instruction showing how I can automatically let Windows to convert everything to UTF-8 at the input stage. If not, UCRT won't be that different from manually converting to UTF-8 with #> [1] ef bf bd ef bf bd c6 bd ef bf bd ef bf bd
charToRaw(stri_conv("县", to = "UTF8"))
#> [1] ef bf bd ef bf bd
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("县")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("县")
#> [1] "native"
stri_detect_regex(iconv("昌平区", to = "UTF8"), "县") # supposed to be FALSE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), "县") # supposed to be TRUE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), iconv("县", to = "UTF8")) # supposed to be FALSE
#> [1] TRUE
|
Hmmm... are these really generated with The byte sequence |
Oh, I might mislead you! The above outputs were produced without setting the library(stringi)
stri_enc_set("Windows-936")
#> New settings: stringi_1.7.3 (en_US.GBK; ICU4C 69.1 [bundle]; Unicode 13.0)
#> Warning message:
#> In stri_info(short = TRUE) :
#> Your native charset does not map to Unicode well. This may cause serious problems. Consider switching to UTF-8.
charToRaw(stri_conv("昌平区", to = "UTF8"))
#> [1] e6 98 8c e5 b9 b3 e5 8c ba
charToRaw(stri_conv("县", to = "UTF8"))
#> [1] e5 8e bf
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("县")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("县")
#> [1] "native" |
:) Dear all, has anyone working in this locale experienced similar issues? |
stri_detect_regex
looks not recognizing Chinese characters correctly when it is treated as a regex pattern. I'm using the 1.4.0.9000 dev version on R 4.1.0. Here's an example:Another example:
The issue was submitted to
stringr
(tidyverse/stringr#386 (comment)), but it looks like astringi
problem?The text was updated successfully, but these errors were encountered: