Selecting Rows With Chinese Characters in MySQL
…or Russian/Cyrillic characters, or Arabic, or any characters of non-latin origin.
On this blog, I’ve let tons of spam comments pile up in the pending queue. Their number grew to the point of being completely unmanageable. Finally I decided to clean it up, and I had to find a way to bulk delete rows from the wp_comments table in the WordPress database. One thing I noticed was that hundreds, probably thousands, of spam comments contained Chinese characters. Other comments contained Russian or other non-latin characters. But how to select comments in which non-latin characters appear without selecting for specific strings (for example by using LIKE or REGEXP)?
It turns out the way to do it is like this:
SELECT * FROM wp_comments WHERE comment_content != CONVERT(comment_content USING latin1)
CONVERT … USING will convert the column value to the specified character set, replacing any characters that do not map into the set with a question mark. If there are no Chinese/Russian/etc. characters, the output of CONVERT … USING will be the same as the unconverted column value.
This was the SQL statement I used to delete the unwanted posts:
DELETE FROM wp_comments WHERE comment_content != CONVERT(comment_content USING latin1) AND comment_approved = '0'
If there is any interest, I can post some other SQL statements I used to clean up the spam.