Regex for validating and sanitizing all english and non-english unicode alphabet characters in PHP -


while there have been many questions regarding non-english characters regex issue have not been able find working answer. moreover, there not seem simple php library me filter non-english input.

could please suggest me regular expression allow

  1. all english alphabet characters (abc...)
  2. all non-english alphabet characters (šýüčá...)
  3. spaces
  4. case insensitive

in validation sanitization. essentially, want either preg_match return false when input contains else 4 points above or preg_replace rid of except these 4 categories.

i able create '/^((\p{l}\p{m}*)|(\p{cc})|(\p{z}))+$/ui' http://www.regular-expressions.info/unicode.html. regular expression works when validating input not when sanitizing it.

edit:

user enters 'český [jazyk]' input. using '/^[\p{l}\p{zs}]+$/u' in preg_match, script determines string contains unallowed characters (in case '[' , ']'). next use preg_replace, delete unwanted characters. regular expression should pass preg_replace match characters not specified regular expression stated above?

i think need character class like:

^[\p{l}\p{zs}]+$ 

it means: whole string (or line, (?m) option) can contain unicode letters or spaces.

have @ demo.

$re = "/^[\\p{l}\\p{zs}]+$/um";  $str = "all english alphabet characters (abc...)\nall non-english alphabet characters (šýüčá...)\nspace s\nšýüčá šýüčá šýüčá ddd\nšýüčá eee 4\ncase insensitive";  preg_match_all($re, $str, $matches); 

to remove symbols not unicode letters or spaces, use code:

$re = "/[^\\p{l}\\p{zs}]+/u";  $str = "český [jazyk]";  echo preg_replace($re, "", $str); 

the output of sample program:

český jazyk 

Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -