validation - check for validity of URL in java. so as not to crash on 404 error -


essentially, bulletproof tank, want program absord 404 errors , keep on rolling, crushing interwebs , leaving corpses dead , bludied in wake, or, w/e.

i keep getting error:

exception in thread "main" org.jsoup.httpstatusexception: http error fetching url. status=404, url=https://en.wikipedia.org/wiki/hudson+township+%28disambiguation%29 @ org.jsoup.helper.httpconnection$response.execute(httpconnection.java:537) @ org.jsoup.helper.httpconnection$response.execute(httpconnection.java:493) @ org.jsoup.helper.httpconnection.execute(httpconnection.java:205) @ org.jsoup.helper.httpconnection.get(httpconnection.java:194) @ q.wikipedia_disambig_fetcher.all_possibilities(wikipedia_disambig_fetcher.java:29) @ q.wikidata_q_reader.getq(wikidata_q_reader.java:54) @ q.wikipedia_disambig_fetcher.all_possibilities(wikipedia_disambig_fetcher.java:38) @ q.wikidata_q_reader.getq(wikidata_q_reader.java:54) @ q.runner.main(runner.java:35) 

but can't understand why because am checking see if have valid url before navigate it. checking procedure incorrect?

i tried examine other stack overflow questions on subject they're not authoritative, plus implemented many of solutions this one , this one, far nothing has worked.

i'm using apache commons url validator, code i've been using recently:

    //get it's normal wiki disambig page     string url_check = "https://en.wikipedia.org/wiki/" + associated_alias;      urlvalidator urlvalidator = new urlvalidator();      if ( urlvalidator.isvalid( url_check ) )      {        document docx = jsoup.connect( url_check ).get();         //this can handle less structured ones.  

and

    //check validity of url     string url_czech = "https://www.wikidata.org/wiki/special:itembytitle?site=en&page=" + associated_alias + "&submit=search";      urlvalidator urlvalidator = new urlvalidator();      if ( urlvalidator.isvalid( url_czech ) )      {         url wikidata_page = new url( url_czech );         urlconnection wiki_connection = wikidata_page.openconnection();         bufferedreader wiki_data_pagecontent = new bufferedreader(                                                    new inputstreamreader(                                                         wiki_connection.getinputstream())); 

the urlconnection throws error when status code of webpage downloading returns other 2xx (such 200 or 201 ect...). instead of passing jsoup url or string parse document consider passing input stream of data contains webpage.

using httpurlconnection class can try download webpage using getinputstream() , place in try/catch block , if fails attempt download via geterrorstream().

consider bit of code download wiki page if returns 404

string url_czech = "https://en.wikipedia.org/wiki/hudson+township+%28disambiguation%29";  url wikidata_page = new url(url_czech); httpurlconnection wiki_connection = (httpurlconnection)wikidata_page.openconnection(); inputstream wikiinputstream = null;  try {     // try connect , use input stream     wiki_connection.connect();     wikiinputstream = wiki_connection.getinputstream(); } catch(ioexception e) {     // failed, try using error stream     wikiinputstream = wiki_connection.geterrorstream(); } // parse input stream using jsoup jsoup.parse(wikiinputstream, null, wikidata_page.getprotocol()+"://"+wikidata_page.gethost()+"/"); 

Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -