A simple html tidy

linux
Written by Joe   
Friday, 27 June 2008 01:34

A simple html tidy program with Java Stack. It can recognize the Error tags and only start tag, unmatched tag. If we need a complete html tidy, we need a html tag and attributes dictionary.

import java.util.Iterator;

public class TagIterator implements Iterator {

static class Tag {
enum Type {
Complete, Start, END, Comments,Error,OnlyStart
};

Type type;

String all;

String name="";

int start;

int end;

public String toString() {
String t = null;
if (type == Type.Complete) {
t = "complete";
} else if (type == Type.Start) {
t = "Start";
} else if (type == Type.END) {
t = "end";
} else if(type==Type.Comments) {
t = "Comments";
}else if(type==Type.Error){
t="Error";
}else if(type==Type.OnlyStart){
t="OnlyStart";
}
return "Tag[type=" + t + ";name=" + name + ";start=" + start
+ ";end=" + end + ";all=" + all + "]";
}
}

private String src = null;

private int index = 0;

private int length = 0;

private Tag tag;

private static char LEFT = '<';

private static char RIGHT = '>';

private static char END = '/';

private static String COMMENTS = " next && next != -1) {
System.err.println("Error!" + src.substring(start, next+1));
tag.type=Tag.Type.Error;

}
if (end != -1) {
String tagInfo = src.substring(start, end + 1).trim();
tag.all = tagInfo;
if (tagInfo.startsWith(COMMENTS)) {
tag.type = Tag.Type.Comments;

} else if (tagInfo.charAt(1) == END) {
tag.type = Tag.Type.END;
tag.name = tagInfo.substring(2, tagInfo.length() - 1);
} else if (tagInfo.charAt(tagInfo.length() - 2) == END) {
tag.type = Tag.Type.Complete;
int i = tagInfo.indexOf(' ');
if (i != -1) {
tag.name = tagInfo.substring(2, i);
} else {
tag.name = tagInfo.substring(2, tagInfo.length() - 2);
}
} else {
tag.type = Tag.Type.Start;
int i = tagInfo.indexOf(' ');
if (i != -1) {
tag.name = tagInfo.substring(1, i);
} else {
tag.name = tagInfo.substring(1, tagInfo.length() - 1);
}
}
this.index = end + 1;


} else {
System.err.println("error:only start:" + src.substring(start));
tag.type=Tag.Type.OnlyStart;
this.index=start+1;
}
return true;

}

public Tag next() {
if (tag.name != null) {
tag.name = tag.name.toLowerCase().trim();
}
return tag;
}

public void remove() {

}

}
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Iterator;
import java.util.List;
import java.util.Stack;
import java.util.regex.Pattern;

import base.helper.TagIterator.Tag;
import base.hibernate.Content;
import base.hibernate.ContentManager;

public class TidyHtml {

private int badFormat = 0;

public String handleUnMatcher(String src) {
Stack startTags = new Stack();
Stack endTags = new Stack();
List onlyStartTags = new ArrayList();
TagIterator it = new TagIterator(src);
while (it.hasNext()) {
Tag tag = it.next();

if (tag.name.equals("br") || tag.name.startsWith(" unMatchedTags = new ArrayList();
if (!startTags.isEmpty() || !endTags.isEmpty()) {
for (Tag t : endTags) {
String endTagName = t.name;
int size = startTags.size();
boolean isMatched = false;
for (int i = size - 1; i >= 0; i--) {
Tag startTag = startTags.get(i);
if (endTagName.equals(startTag.name)) {
isMatched = true;
startTags.remove(i);
break;
}
}
if (!isMatched) {
unMatchedTags.add(t);
}
}
unMatchedTags.addAll(startTags);
}
Collections.sort(unMatchedTags, new Comparator() {
public int compare(Tag t1, Tag t2) {
return t1.start - t2.start;
}
});
Iterator iterator = unMatchedTags.iterator();
String[] ignoreTags = { "p", "li", "ul", "pre", "a" };
int alreadyRemovedOffset = 0;
while (iterator.hasNext()) {
Tag tag = iterator.next();
for (String ignore : ignoreTags) {
if (ignore.equals(tag.name)) {
src = src.substring(0, tag.start - alreadyRemovedOffset)
+ src.substring(tag.end + 1 - alreadyRemovedOffset);
alreadyRemovedOffset += tag.all.length();
iterator.remove();
break;
}
}
}

if (unMatchedTags.size() > 1) {
System.out.println(unMatchedTags);
badFormat++;
}
if (!onlyStartTags.isEmpty()) {
Tag t = onlyStartTags.get(0);
src = src.substring(0, t.start - alreadyRemovedOffset);

}
return src;
}

}
 
linux
Some open source for OCR, Image recognition, handwriting recognition
Written by Joe   
Friday, 27 June 2008 01:15

Recently, I want to do something about image recognition, so I Google a lot. And find some open source about image recognition, handwriting recognition and OCR etc.

  1. tesseract-ocr

    The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.

    Now it's under Google code. You can get it here

  2. gocr

    GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files. Joerg Schulenburg started the program, and now leads a team of developers. GOCR can be used with different front-ends, which makes it very easy to port to different OSes and architectures. It can open many different image formats, and its quality have been improving in a daily basis.

    You can get it from SourceForge.

  3. JOONE

    The Java Object Oriented Neural Network (JOONE) is an open source project that offers a highly adaptable neural network for Java programmers. The JOONE project source code is covered by a Lesser GNU Public License (LGPL). In a nutshell, this means that the source code is freely available and you need to pay no royalties to use JOONE. JOONE can be downloaded from sourceforge.net.

 
Joomla MultiAds Tutorial
Written by Joe   
Thursday, 26 June 2008 03:49

After install MultiAds plugin, open Extensions->Plugin Manager, you will find the MultiAds. MultiAds in Joomla plugin manager

Select MultiAds and click edit button, you will see the MultiAds configuration page. Joomla MultiAds configuration

In MultiAds configuration page, please notice

  1. Make sure the 'Enable' select 'Yes'.
  2. Just copy the ads code into the four input box.
  3. Align style:Left,Right,Center,None. The align style only affect 'Content top ads'.
  4. 'Content top ads' and 'Content bottom ads' only appear in the article show.
In the following picture,the image above the article is shown by 'Before Content Ads'. Joomla MultiAds Before Content Ads

In the following picture, you will find the 'Content top Ads','Content bottom Ads' and 'After Content Ads'. Joomla MutiAds Result

You can also put more than one ads in one box. In the above picture, I have put two Google Ads in 'Content top Ads', one is 336*228 image style, the other is 336*228 text style. When you put two Ads in one box, please select 'None' ore 'Center' Align style.

 
How to change the category list page number of Joomla 1.5
Written by Joe   
Wednesday, 25 June 2008 05:04

In Joomla1.5, the default articles category layout is 'Category List Layout', which will list all articles of the category. If the category has a great number of articles, the response speed will be very slow.

We can give it a default page number to solve this problems.

Open file components/com_content/views/category/view.html.php, find line 68, and change it as the following:

      if ($layout == 'blog') {
        if($limit ==  0) $limit = $intro + $leading + $links;
      }
      else {
        if($limit ==  0) $limit = 20;
      }

After done that,for the 'Category List Layout', each page will dispay 20 items.

 
How to alter meta data gernerated by Joomla?
Written by Joe   
Wednesday, 25 June 2008 04:38

In joomla 1.5, you will notice all meta data generated are the same. It seems like:

<title></title>
<meta name="title" content="" />
<meta name="author" content=""/>
<meta name="description" content="" />
<meta name="keywords" content="" />
<meta name="generator" content=""/>
<meta name="robots" content="index, follow" />

However, this is not Search engine friendly. For google, the webmaster will tell you that it's duplicate meta. With Joomla SEF Patch, You can easily alter the meta data.

 
More Articles...
<< Start < Prev 1 2 3 4 5 6 7 8 9 Next > End >>

Page 7 of 9