How to write a regex to find HTML tags outside CDATA tags in an XML document -


i'm trying import onix (xml) file coming import errors due html tags in descriptive text. in particular file, of descriptive text enclosed in cdata tags, appears isn't.

how can write regex find html tags aren't enclosed in cdata tags?

i'm using vb.net app import data sql server database, @ point i'm trying write regex in notepad++ see what's possible. can incorporate regex vb code later.

here example of xml import properly:

<othertext>   <texttypecode>01</texttypecode>   <textformat>02</textformat>   <text><![cdata[more series of chapters on theology of john's gospel, <em>jesus christ</em> relates each of john's teachings declared aim, expressed in john 20: 30-31: "jesus did many other signs before disciples, have not been written in book; these have been written may believe jesus christ, son of god, , believing may have life in name." indeed, each chapter in morris's book takes facet or aspect of john's expressed aim.<br/><br/>for age still asking question "who jesus?" leon morris argues convincingly john's entire gospel written show human jesus christ, or messiah, son of god. morris's firm conviction john's purpose evangelical theological -- is, john wrote book readers might believe in christ , result have eternal life.]]></text> </othertext> 

and here xml won't import properly:

<othertext>   <texttypecode>01</texttypecode>   <textformat>02</textformat>   <text>more series of chapters on theology of john's gospel, <em>jesus christ</em> relates each of john's teachings declared aim, expressed in john 20: 30-31: "jesus did many other signs before disciples, have not been written in book; these have been written may believe jesus christ, son of god, , believing may have life in name." indeed, each chapter in morris's book takes facet or aspect of john's expressed aim.<br/><br/>for age still asking question "who jesus?" leon morris argues convincingly john's entire gospel written show human jesus christ, or messiah, son of god. morris's firm conviction john's purpose evangelical theological -- is, john wrote book readers might believe in christ , result have eternal life.</text> </othertext> 

now,

<textformat>02</textformat>  

indicates contents of tag html, can handle ok. problem comes in when have tags aren't labelled appropriately. need find can correct them.

this regex can somewhere:

<\w+>(?!<![cdata[) 

i ran on examples provided in sublime text, , matched html tags aren't followed cdata stuff.


Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -