validation - check for validity of URL in java. so as not to crash on 404 error -
essentially, bulletproof tank, want program absord 404 errors , keep on rolling, crushing interwebs , leaving corpses dead , bludied in wake, or, w/e.
i keep getting error:
exception in thread "main" org.jsoup.httpstatusexception: http error fetching url. status=404, url=https://en.wikipedia.org/wiki/hudson+township+%28disambiguation%29 @ org.jsoup.helper.httpconnection$response.execute(httpconnection.java:537) @ org.jsoup.helper.httpconnection$response.execute(httpconnection.java:493) @ org.jsoup.helper.httpconnection.execute(httpconnection.java:205) @ org.jsoup.helper.httpconnection.get(httpconnection.java:194) @ q.wikipedia_disambig_fetcher.all_possibilities(wikipedia_disambig_fetcher.java:29) @ q.wikidata_q_reader.getq(wikidata_q_reader.java:54) @ q.wikipedia_disambig_fetcher.all_possibilities(wikipedia_disambig_fetcher.java:38) @ q.wikidata_q_reader.getq(wikidata_q_reader.java:54) @ q.runner.main(runner.java:35)
but can't understand why because am checking see if have valid url before navigate it. checking procedure incorrect?
i tried examine other stack overflow questions on subject they're not authoritative, plus implemented many of solutions this one , this one, far nothing has worked.
i'm using apache commons url validator, code i've been using recently:
//get it's normal wiki disambig page string url_check = "https://en.wikipedia.org/wiki/" + associated_alias; urlvalidator urlvalidator = new urlvalidator(); if ( urlvalidator.isvalid( url_check ) ) { document docx = jsoup.connect( url_check ).get(); //this can handle less structured ones.
and
//check validity of url string url_czech = "https://www.wikidata.org/wiki/special:itembytitle?site=en&page=" + associated_alias + "&submit=search"; urlvalidator urlvalidator = new urlvalidator(); if ( urlvalidator.isvalid( url_czech ) ) { url wikidata_page = new url( url_czech ); urlconnection wiki_connection = wikidata_page.openconnection(); bufferedreader wiki_data_pagecontent = new bufferedreader( new inputstreamreader( wiki_connection.getinputstream()));
the urlconnection
throws error when status code of webpage downloading returns other 2xx (such 200 or 201 ect...). instead of passing jsoup url or string parse document consider passing input stream of data contains webpage.
using httpurlconnection
class can try download webpage using getinputstream()
, place in try/catch
block , if fails attempt download via geterrorstream()
.
consider bit of code download wiki page if returns 404
string url_czech = "https://en.wikipedia.org/wiki/hudson+township+%28disambiguation%29"; url wikidata_page = new url(url_czech); httpurlconnection wiki_connection = (httpurlconnection)wikidata_page.openconnection(); inputstream wikiinputstream = null; try { // try connect , use input stream wiki_connection.connect(); wikiinputstream = wiki_connection.getinputstream(); } catch(ioexception e) { // failed, try using error stream wikiinputstream = wiki_connection.geterrorstream(); } // parse input stream using jsoup jsoup.parse(wikiinputstream, null, wikidata_page.getprotocol()+"://"+wikidata_page.gethost()+"/");