regex - extract all "ul a" entities from html page that have a certain String in the "title" -
in style of example on this page i'm trying of senses in particular name applied specific person based on wikipedia disambiguation page.
the trouble wikipedia pages highly non-uniform.
one common feature though list of names appear in ul
element part of link a
, in title=
component of link there reference name we're looking for. since these links associated wikipedia pages.
using jsoup, or other method, how recognize these components?
h2:contains(people) + ul a
^that works when they're in section entitled people
mentioned, not case.
perhaps in pseudocode this:
ul && title contains *string*
maybe this:
a[href], [title]
but matching part of title, not whole thing.
this example of non-structured page such method called for.
this example of 1 it's not important.
but i'm trying make generalizable apply equally both types.
this kind of works:
elements linx = docx.select("a:contains(corzine)"); (element linq : linx) { system.out.println(linq.text()); }
but maybe 1 among might hit upon better solution.