Regex for validating and sanitizing all english and non-english unicode alphabet characters in PHP -
while there have been many questions regarding non-english characters regex issue have not been able find working answer. moreover, there not seem simple php library me filter non-english input.
could please suggest me regular expression allow
- all english alphabet characters (abc...)
- all non-english alphabet characters (šýüčá...)
- spaces
- case insensitive
in validation sanitization. essentially, want either preg_match return false when input contains else 4 points above or preg_replace rid of except these 4 categories.
i able create '/^((\p{l}\p{m}*)|(\p{cc})|(\p{z}))+$/ui'
http://www.regular-expressions.info/unicode.html. regular expression works when validating input not when sanitizing it.
edit:
user enters 'český [jazyk]' input. using '/^[\p{l}\p{zs}]+$/u'
in preg_match, script determines string contains unallowed characters (in case '[' , ']'). next use preg_replace, delete unwanted characters. regular expression should pass preg_replace match characters not specified regular expression stated above?
i think need character class like:
^[\p{l}\p{zs}]+$
it means: whole string (or line, (?m)
option) can contain unicode letters or spaces.
have @ demo.
$re = "/^[\\p{l}\\p{zs}]+$/um"; $str = "all english alphabet characters (abc...)\nall non-english alphabet characters (šýüčá...)\nspace s\nšýüčá šýüčá šýüčá ddd\nšýüčá eee 4\ncase insensitive"; preg_match_all($re, $str, $matches);
to remove symbols not unicode letters or spaces, use code:
$re = "/[^\\p{l}\\p{zs}]+/u"; $str = "český [jazyk]"; echo preg_replace($re, "", $str);
the output of sample program:
český jazyk