regex - extract all "ul a" entities from html page that have a certain String in the "title" -


in style of example on this page i'm trying of senses in particular name applied specific person based on wikipedia disambiguation page.

the trouble wikipedia pages highly non-uniform.

one common feature though list of names appear in ul element part of link a , in title= component of link there reference name we're looking for. since these links associated wikipedia pages.

using jsoup, or other method, how recognize these components?

h2:contains(people) + ul a

^that works when they're in section entitled people mentioned, not case.

perhaps in pseudocode this:

ul && title contains *string*

maybe this:

a[href], [title] 

but matching part of title, not whole thing.


this example of non-structured page such method called for.

this example of 1 it's not important.

but i'm trying make generalizable apply equally both types.

this kind of works:

        elements linx = docx.select("a:contains(corzine)");          (element linq : linx)          {             system.out.println(linq.text());         } 

but maybe 1 among might hit upon better solution.


Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -