regex - extract loosly structured wikipedia text. html -


some of html on wikipedia disambiguation pages is, shall say, ambiguous, i.e. links there connect specific persons named corzine difficult capture using jsoup because they're not explicitly structured, nor live in particular section in this example. see page corzine page here.

how can hold of them? jsoup suitable tool task?

perhaps should use regex, fear doing because want generalizable.

</b> may refer to:</p>   <ul>    <li><a href 

^this here standard, maybe use regex match that?

<p><b>corzine</b> may refer to:</p>   <ul>    <li><a href="/wiki/dave_corzine" title="dave corzine">dave corzine</a> (born 1956), basketball player</li>    <li><a href="/wiki/jon_corzine" title="jon corzine">jon corzine</a> (born 1947), former ceo of <a href="/wiki/mf_global" title="mf global">mf global</a>, former governor on new jersey, former ceo of <a href="/wiki/goldman_sachs" title="goldman sachs">goldman sachs</a></li>   </ul>   <table id="setindexbox" class="metadata plainlinks dmbox dmbox-setindex" style="" role="presentation">  

the ideal output

dave corzine jon corzine 

maybe possible match section </b> may refer to:</p> , <table id="setindexbox" , extract that's in between. guess <table id="setindexbox" matched enough in jsoup, </b> may refer to:</p> should more difficule because <b> or <p> not distinguished.


i tried this:

      elements table = docx.select("ul");       elements links = table.select("li");        pattern ppp = pattern.compile("table id=\"setindexbox\" ");     matcher mmm = ppp.matcher(inputline);      pattern pp = pattern.compile("</b> may refer to:</p>");     matcher mm = pp.matcher(inputline);     if (mm.matches())      {     while(!mmm.matches())       (element link: links)        {           string url = link.attr("href");           string text = link.text();           system.out.println(text + ", " + url);       }     } 

but didn't work.

this selector works:

elements els = doc.select("p ~ ul a:eq(0)"); 

see: http://try.jsoup.org/~ypvgr0pxva3owqsjte4rfm-ls2y

that's looking first element (a:eq(0)) in ul that's sibling of p. p:contains(corzine) ~ ul a:eq(0) if there other conflicts.

or perhaps more generally: :contains(may refer to) ~ ul a:eq(0)

it's hard generalize wikipedia because it's unstructured. imho it's easier use parser , css selectors regexes, particularly on time when templates change etc.


Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -