How to parse html list value using java

Question

i just want to parse two values from a html file .

enter image description here

there will be several list elements in the html file and i want to parse two values

a. 1 ,100, 101 b. Swargate to Shivajinagar Circle route , Mnapa bhavan to.. ,Kothrud depot to...

i have used the below code to parse it, but i am not getting the required values , here i am getting href value only.

please give me any solution for the above problem

   String html =

   "<li/><a href=r361.html>1</a> Swargate to Shivajinagar Circle route"+
  " <li/><a href=r511.html>100</a> Manpa bhavan to Hinjewadi phase 3"+
   "<li/><a href=r572.html>101</a> Kothrud depot to Kondhava Bu";

   Reader reader = new StringReader(html);
   HTMLEditorKit.Parser parser = new ParserDelegator();
   final List<String> links = new ArrayList<String>();

   parser.parse(reader, new HTMLEditorKit.ParserCallback(){
       public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
           if(t == HTML.Tag.A) {
               Object link = a.getAttribute(HTML.Attribute.HREF);
               if(link != null) {
                   links.add(String.valueOf(link));
               }
           }
       }
   }, true);

   reader.close();
   System.out.println(links);

}

UPDATE:

Now i am getting the value of a href using below code (using JSOUP Lib)

AssetManager assetManager = getAssets(); InputStream ims =assetManager.open("index.html"); Document doc = Jsoup.parse(ims, "UTF-8", "btc.com"); Elements busNum = doc.getElementsByTag("a"); pTagString = busNum.html();

Log.i("hh"," onPostExecute ="+PTagString);

Now i want to get the Value out side the a href for eg: Swargate to shivajinagar circle route.

anybody know the method or any idea

That does not look like valid HTML. Are you sure this is the input you need to parse? — Wim Deblauwe
– Wim Deblauwe, Commented Oct 9, 2012 at 13:30
i just took the code from html which i want to parse. right now i want to check whether i can parse the required values. dats it. when i parse above code i am getting the result like this [r361.html, r511.html, r572.html] — Kris
– Kris, Commented Oct 9, 2012 at 13:36
You want the a tag content, not the href attribute for starters. Also, @WimDeblauwe, not positive, but I think that's valid html assuming it's in a ul tag, albeit highly discouraged. — Neil
– Neil, Commented Oct 9, 2012 at 13:39
Try to override some of the other methods to see which callbacks are being called. For example: public void handleText(char[] data,int pos) — Wim Deblauwe
– Wim Deblauwe, Commented Oct 9, 2012 at 13:41
exactly correct . you have any idea about it.. pls share it to me . — Kris
– Kris, Commented Oct 9, 2012 at 13:41

Chris · Accepted Answer · 2012-10-09 14:14:24Z

1

You don't even need to use a parse for this. You could use a regular expression.

See this Tutorial about regex in Java

And then you'll need something like this:

<a[^>]*>([^<]*)<[^>]*>(.*)

as your regular expression. Then you will have both values you need in no time. It's much more performant than parsing the html.

answered Oct 9, 2012 at 14:14

Chris

7,8268 gold badges54 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to parse html list value using java

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related