HTML fixer

THIS PROJECT IS NO LONGER MANTAINED, YOU CAN USE THE SOURCE CODE IN YOUR PROJECT AS YOU WANT LAST UPDATE:…

THIS PROJECT IS NO LONGER MANTAINED, YOU CAN USE THE SOURCE CODE IN YOUR PROJECT AS YOU WANT

LAST UPDATE: 06/07/2010
STOP bad html inserted by your clients or by the users of your community!

This PHP class lets you clean and repair html code. Here is a quick list of the magic things it can do (it’s really good when you don’t have the possibility to install the Html Tidy module of PHP).

WHAT IT DOES:

  1. delete closed tags without their opening tag
  2. fix open tag without close, closing them automatically
  3. check bad nesting and fix them (if you have a bold inside a bold… or a paragrah that contains a table…)
  4. fix bad quotes in attributes (open quotes where missing…)
  5. merge different styles attributes in the same tag
  6. remove html comments
  7. remove empty tags and more bad tags

How does it works?
it’s a bit complex to explain, it analyzes char by char the html code, detecting nodes, watching inside each node to fix quotes, attributes, and more and finding their closing tags. Save every node found and it’s inner content in a matrix.
And then it reads the matrix to re-build the fixed html.
The matrix stores open tags, closed tags and content and lets count the errors.

Watch a demo of the HTML FIXER CLASS with debug.
Watch a demo of the HTML FIXER CLASS with a textarea to insert dirty html code.
Download the class and the example.

HISTORY
NEW. version 2.05 date 06/07/2010
bug fixed on quotes by emmanuel (at) evobilis.com

version 2.04
added css style filter by Martin Vool

version 2.03
strips php code.

version 2.02
fixed a bug with non closing quotes.

Comments on “HTML fixer”

50 thoughts

  1. Pol moneys ha detto:

    interesting, thanks for sharing.

  2. Pere ha detto:

    Hi,

    I’m having the following problem with your class:

    Imagine that my string is:
    $string = “Etiam non massa sit amet lacus ultrices pulvinar id ultrices tellus. Proin ornare ante vel nibh adipiscing.”

    When I do a $string=substr($string,0,120) my $string contains:
    “Etiam non massa sit amet lacus ultrices pulvinar id ultrices tellus. Proin ornare ante
    <span style="color: rgb(1"

    When I try to fix this with your class it goes to an infinite loop. Any sugestion?

  3. admin ha detto:

    Thank you very much, I’ve fixed it. The problem was due to the non closing quote. I’ve fixed it and now with your example works. Download the new version (2.02) at the same address.

  4. alessio ha detto:

    Nella tua demo online se scrivo semplicemente: ?>
    da il seguente warning che sarebbe meglio nascondere ;)
    Warning: preg_match() [function.preg-match]: Compilation failed: nothing to repeat at offset 0 in /web/htdocs/www.barattalo.it/home/examples/htmlfixer.class.php on line 363

    saluti da chicago

  5. alessio ha detto:

    scusa intendevo <?

  6. admin ha detto:

    Thanks, fixed. Now it strips also php code.

  7. Vinodkumar ha detto:

    this class is not opening a tag, for the closed tags! Please let me know how we can do this?

  8. Martin Vool ha detto:

    Hello, i added a little feature to this. It enables to set the allowed style attributes (I dont want the user to screw up the website font face or encoding etc… by pasting from word or similar).
    To add this feature i modifyed the code as follows:
    mergeStyleAttributes functions end:
    $check=explode(‘;’, $x);
    $x=”;
    foreach($check as $chk){
    foreach($this->allowed_styles as $as){
    if(stripos($chk, $as) !== False){
    $x.=$chk.’;’;
    break;
    }
    }
    }
    if ($c>0) $s = str_replace(“##PUTITHERE##”,”style=\””.$x.”\””,$s);
    return $s;

    And added “public $allowed_styles;” to the class.
    To use it:
    $a = new HtmlFixer();
    $a->allowed_styles=array(‘font-weight’,’width’,’height’,’align’,’valign’,…);
    $str = $a->getFixedHtml(stripslashes($str));

    Can be easely rewritten to restrict only listed options…
    just put a ! in front of “stripos”

    Maybe you should write a strip_tags similar functionality also? (so to be able to allow/deny certain html tags? I could help you.

  9. Darren ha detto:

    Hi,

    Great and useful script, was wondering, how to allow php to be included within the dirty html (but obviously ignored, so the php is not effected but the html is – so the php remains in the output), as im a beginner php user, and have a file containing both dirty html and php, extracting php and readding it is a tedious task!.

    Looking forward to your reply.

    Thanks

  10. admin ha detto:

    Thank you Martin Vool, I’ve added your code to the class!

  11. Darren ha detto:

    Hi, admin, can you check:

    https://www.barattalo.it/html-fixer/comment-page-1/#comment-1439

    Any ideas? :D

    Thanks

  12. Emmanuel ha detto:

    Hi,

    It seems you class doesn’t work for html code with missing quote as follow:

    I checked your code, I found 2 possible issue:
    1- in function “fixQuotes”: you forgot to init the variable $q to “\””

    private function fixQuotes($s) {
    $q = “\””;

    2- in function “fixTag”: you code will fix quote only for Tag that have a beginning quote
    “if (stristr($ar[$i],”=”) && !stristr($ar[$i],”=\””)) $ar[$i] = $this->fixQuotes($ar[$i]);”
    It hsould fix quote for any case:

    if (stristr($ar[$i],”=”)) $ar[$i] = $this->fixQuotes($ar[$i]);

    Ive tested it, it seems to work correctly. I let you check and amend your class.

    Cheers & thanks again for your class.

  13. Dave ha detto:

    Thanks for sharing this tool. Great work.

  14. Kim Steinhaug ha detto:

    Great class, it really works very well. However I would argue some of your logic.

    Example:
    here we go!One more line

    Your class will kill the em as your logic denies em outside a P, resulting in this code:

    here we go!One more line

    However, given the fact that code might come from eg. TinyMVE or another WYSIWYG my example is a valid one and should produce the following code if fixed at all:

    here we go!One more line

    Killing the doesnt really fix anything, it accually destroys the presentation of the page.

    However, easy to prevent in your code however by just modifying the array for the p check. As I do not have had time to let your class sink into my brain yet, I havnt come up with my fix just disabled the fix. If I do, I will surely post it here!

  15. Kim Steinhaug ha detto:

    That didn’t work, original post with the correct HTML here:
    http://codepad.org/PGSn8vuG

  16. admin ha detto:

    I’ve understood your problem, but I think it’s not easy to be fixed. If you modify the class tell me and send me the new code, I will update the class and add your name to the class’ developer.

  17. admin ha detto:

    Thanks, I’ve modified the code.

  18. admin ha detto:

    Hi Darren, I think that is necessary to detect the php start and end tag and work in a different way for that code. Not so simple I think. :) Let me know if you want to extend this class with that feature!

  19. tj ha detto:

    Thanks for sharing this class. Great work.

  20. Martin ha detto:

    Very good. But I found problem with ordered and unordered lists. Example: Bad code with only opened tags “sometingother” it’s not working well, I got closing tags at end code instead of end every list item. See “sometingother”

  21. Martin ha detto:

    Form deleted inserted html tags ….

  22. mardi siswo utomo ha detto:

    sorry if I’m wrong, but it’s has wrong quote at line number 100, but I help you to fix it.
    Mine it’s working great now, if anyone want the working code you can download from

    http://blog-walk.com/download <—- delete if you don't like it and mail me, I'll remove it from my site

    regrads

  23. Giulio Pons ha detto:

    I’ve also fixed it here. Bye!

  24. Giulio Pons ha detto:

    Please, make the link to my site clickable, I’ve also fixed the quotes on line 100. Thank you.

  25. Bogdan Bratila ha detto:

    Hello,
    If you have a string like this "Buna ziua,
    Multa sanatate si bucurii!!!"
    <p align="right"

    The result code breaks even more:
    Fixed code:"Buna ziua, Multa sanatate si bucurii!!!"

    So, if you have a tag that doesen’t end (like that p tag, or a span….), i don’t know why, but it brakes the code before that tag…..

  26. Bogdan Bratila ha detto:

    So it seems I was wrong, it has nothing to do with an unclosed tag.
    If the tag has styles attached, then it brakes it.
    <span style=”font-size: 11pt; font-family: Arial; color: black;”>asdasd</span>
    The result is:
    <span style=”font-size:;” 11pt; font-family: Arial; color: black;”>asdasd</span>
    :D

  27. Carl ha detto:

    Got an error with this

    fixedhtml should be fixedxhtml ???

    was getting duplication of the content when calling the getFixedHtml() method of the same object more than once.

    Think it just wasnt resetting the fixedxhtml variable at start of each call because of the naming.

  28. Yahasana ha detto:

    Thanks for sharing!

    please set all private vars or methods as protected for extending.

  29. Giulio Pons ha detto:

    YES! thank you.

  30. tommy ha detto:

    nice job… thx for sharing

  31. Ale. ha detto:

    Ciao Giulio,
    non ho trovato alcun riferimento alla licenza di utilizzo di questa classe. Mi dai delucidazioni in merito?

    Grazie!

  32. Giulio Pons ha detto:

    Mmm non conosco molto bene le varie licenze, per cui non ne ho indicata nessuna. Se ti serve usa pure questa classe! Ciao!

  33. I noted something wrong with your class:

    becomes

    Any fix for this?

  34. style=”text-align: right;”

    becomes

    style=”text-align:;” right;”

  35. Another problem I found was this:

    class=”abc def”

    becomes

    class=”abc” def

  36. diogo ha detto:

    Excellent tool, almost perfect, very useful.
    But when using in this case, I request help:
    Original

    Result from Htmlfix

  37. diogo ha detto:

    Excellent tool, almost perfect, very useful.
    But when using in this case, I request help:
    Original
    <img src="char.jpg" style="width: 378px; height: 378px;" alt="" />
    Result from Htmlfix
    <img src="char.jpg" style="width:;" 378px; height: 378px;" alt="" />

  38. Rafael ha detto:

    Hi – I’m using your class for a little project of mine (see link above) and it works excellently except for one problem. The script is not respecting whitespace in tags. This is necessary according to the HTML specification. At the moment your script replaces all instances of whitespace with a single space. This means newlines, etc. are all “cleaned”. The “cleaning” is *too* efficient! :)

  39. sexytrends ha detto:

    Your class does a good job with html tags however styles get all messed up as diogo wrote above. Guessing it’s your merge function that needs work.

  40. sexytrends ha detto:

    Just did a quick test and your fixQuotes function breaks style attributes, ignore my previous comment.

  41. Snowbird ha detto:

    Great script. Time saver.
    However I see that following is converted wrongly… See style definition

    TEST

    Fixed code:
    TEST

  42. Fredj A J ha detto:

    HTML FIXER CLASS OUTPUT
    style=”position:relative;” font-size:50px; z-index:2;”>LAYER 1

    ORIGINAL CODE
    style=”position:relative; font-size:50px; z-index:2;”>LAYER 1

    I hope you could help me remove the double prime!.
    any help would be appreciated..

  43. Iesha Redshaw ha detto:

    Thank you! My spouse and I liked this posting. I am curious as to what you think of Evergreen?

  44. Nemi ha detto:

    It does not work. Try:
    <div style="width: 151px; display: none;" class="b
    You get:
    <div style="width:;" 151px; display: none;" class="b

  45. Гость ha detto:

    my apache crashes

  46. Simon ha detto:

    Simple and easy! Love it. Now able to auto parse imported web pages with WAY LESS effort!! – Thanks!!!

  47. Miljan ha detto:

    This is very usefull, thank you very much :)

  48. Torben ha detto:

    Hey, great library. Can you please allow frameborder=”0″. This is xhtml 1.0 valid but the string gets half replaced.

    — htmlfixer.class.php (Revision xx)
    +++ htmlfixer.class.php (Arbeitskopie)
    @@ -86,7 +86,7 @@
    $t = preg_replace (
    array(
    ‘/borderColor=([^ >])*/i’,
    – ‘/border=([^ >])*/i’
    + ‘/ border=([^ >])*/i’
    ),

  49. Alaa ha detto:

    Great Job ..
    I have my own HTML fixer but it is stupid so I want to re-build your code as a function not a class .. and join both codes together .
    it will be legal , wrong?

  50. Giulio Pons ha detto:

    yes, you can do what ever you want :-)

Comments are closed