Verwenden von PHP substr() und strip_tags() unter Beibehaltung der Formatierung und ohne HTML zu beschädigen

Question 1

Ich habe verschiedene HTML-Zeichenfolgen, die auf 100 Zeichen (des entfernten Inhalts, nicht des Originals) geschnitten werden müssen, ohne Tags zu entfernen und ohne HTML zu beschädigen.

Ursprünglicher HTML-String (288 Zeichen):

$content = "<div>With a <span class="spanClass">span over here</span> and a
<div class="divClass">nested div over <div class="nestedDivClass">there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";

Standardausstattung: Trimmen auf 100 Zeichen und HTML-Breaks, gestripter Inhalt kommt auf ~40 Zeichen:

$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class="spanClass">span over here</span> and a
<div class="divClass">nested div ove... */

Entferntes HTML: Gibt die korrekte Zeichenanzahl aus, verliert jedoch offensichtlich die Formatierung:

$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */

Teillösung: Die Verwendung von HTML Tidy oder Purifier zum Schließen von Tags gibt sauberes HTML aus, aber 100 Zeichen HTML werden nicht angezeigt.

$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class="spanClass">span over here</span> and a
<div class="divClass">nested div ove</div></div>... */

Herausforderung: Um sauberes HTML auszugeben und n Zeichen (ohne Zeichenanzahl von HTML-Elementen):

$content = cutHTML($content, 100); /* output:
<div>With a <span class="spanClass">span over here</span> and a
<div class="divClass">nested div over <div class="nestedDivClass">there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";

Ähnliche Fragen

Wie man HTML-Fragmente ausschneidet, ohne Tags aufzubrechen

HTML-Strings schneiden, ohne HTML-Tags zu beschädigen

Question 2

Nicht umwerfend, aber funktioniert.

function html_cut($text, $max_length)
{
    $tags   = array();
    $result = "";

    $is_open   = false;
    $grab_open = false;
    $is_close  = false;
    $in_double_quotes = false;
    $in_single_quotes = false;
    $tag = "";

    $i = 0;
    $stripped = 0;

    $stripped_text = strip_tags($text);

    while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
    {
        $symbol  = $text{$i};
        $result .= $symbol;

        switch ($symbol)
        {
           case '<':
                $is_open   = true;
                $grab_open = true;
                break;

           case '"':
               if ($in_double_quotes)
                   $in_double_quotes = false;
               else
                   $in_double_quotes = true;

            break;

            case "'":
              if ($in_single_quotes)
                  $in_single_quotes = false;
              else
                  $in_single_quotes = true;

            break;

            case "https://stackoverflow.com/":
                if ($is_open && !$in_double_quotes && !$in_single_quotes)
                {
                    $is_close  = true;
                    $is_open   = false;
                    $grab_open = false;
                }

                break;

            case ' ':
                if ($is_open)
                    $grab_open = false;
                else
                    $stripped++;

                break;

            case '>':
                if ($is_open)
                {
                    $is_open   = false;
                    $grab_open = false;
                    array_push($tags, $tag);
                    $tag = "";
                }
                else if ($is_close)
                {
                    $is_close = false;
                    array_pop($tags);
                    $tag = "";
                }

                break;

            default:
                if ($grab_open || $is_close)
                    $tag .= $symbol;

                if (!$is_open && !$is_close)
                    $stripped++;
        }

        $i++;
    }

    while ($tags)
        $result .= "</".array_pop($tags).">";

    return $result;
}

Anwendungsbeispiel:

$content = html_cut($content, 100);

Question 3

Ich behaupte nicht, dies erfunden zu haben, aber es gibt eine sehr vollständige Text::truncate() Methode in CakePHP was macht was du willst:

function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
    if (is_array($ending)) {
        extract($ending);
    }
    if ($considerHtml) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen($ending);
        $openTags = array();
        $truncate="";
        preg_match_all('/(</?([w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/</([w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }

    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($considerHtml) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/</([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }

    $truncate .= $ending;

    if ($considerHtml) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

Question 4

Verwenden Sie PHPs DOMDocument Klasse zum Normalisieren eines HTML-Fragments:

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

Diese Frage ähnelt einer früheren Frage, und ich habe hier eine Lösung kopiert und eingefügt. Wenn der HTML-Code von Benutzern übermittelt wird, müssen Sie auch potenzielle Javascript-Angriffsvektoren herausfiltern, z onmouseover="do_something_evil()" oder <a href="https://stackoverflow.com/questions/2398725/javascript:more_evil();">...</a>. Werkzeuge wie HTML-Reiniger wurden entwickelt, um diese Probleme zu erfassen und zu lösen, und sind weitaus umfassender als jeder Code, den ich posten könnte.

Question 5

Benutze einen HTML-Parser und stoppt nach 100 Zeichen Text.

Question 6

Ich habe dafür eine andere Funktion erstellt, die UTF-8 unterstützt:

/**
 * Limit string without break html tags.
 * Supports UTF8
 * 
 * @param string $value
 * @param int $limit Default 100
 */
function str_limit_html($value, $limit = 100)
{

    if (mb_strwidth($value, 'UTF-8') <= $limit) {
        return $value;
    }

    // Strip text with HTML tags, sum html len tags too.
    // Is there another way to do it?
    do {
        $len          = mb_strwidth($value, 'UTF-8');
        $len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
        $len_tags     = $len - $len_stripped;

        $value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
    } while ($len_stripped > $limit);

    // Load as HTML ignoring errors
    $dom = new DOMDocument();
    @$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);

    // Fix the html errors
    $value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));

    // Remove body tag
    $value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
    // Remove empty tags
    return preg_replace('/<(w+)b(?:s+[w-.:]+(?:s*=s*(?:"[^"]*"|"[^"]*"|[w-.:]+))?)*s*/?>s*</1s*>/', '', $value);
}

SIEHE DEMO.

Ich empfehle die Verwendung html_entity_decode zu Beginn der Funktion, sodass die UTF-8-Zeichen beibehalten werden:

 $value = html_entity_decode($value);

Question 7

Du solltest benutzen Ordentliches HTML. Sie schneiden die Zeichenfolge ab und führen dann Tidy aus, um die Tags zu schließen.

(Credits wo Credits fällig sind)

Question 8

Unabhängig von den 100 Punkten, die Sie zu Beginn angeben, geben Sie in der Herausforderung Folgendes an:

gib die Zeichenanzahl von strip_tags aus (die Anzahl der Zeichen im tatsächlich angezeigten Text des HTML)
HTML-Formatierung beibehalten schließen
jedes unvollendete HTML-Tag

Hier ist mein Vorschlag: Grundsätzlich parse ich jedes Zeichen, während ich zähle. Ich achte darauf, KEINE Zeichen in einem HTML-Tag zu zählen. Ich überprüfe auch am Ende, ob ich nicht mitten in einem Wort bin, wenn ich aufhöre. Sobald ich anhalte, gehe ich zum ersten verfügbaren SPACE oder > als Haltepunkt zurück.

$position = 0;
$length = strlen($content)-1;

// process the content putting each 100 character section into an array
while($position < $length)
{
    $next_position = get_position($content, $position, 100);
    $data[] = substr($content, $position, $next_position);
    $position = $next_position;
}

// show the array
print_r($data);

function get_position($content, $position, $chars = 100)
{
    $count = 0;
    // count to 100 characters skipping over all of the HTML
    while($count <> $chars){
        $char = substr($content, $position, 1); 
        if($char == '<'){
            do{
                $position++;
                $char = substr($content, $position, 1);
            } while($char !== '>');
            $position++;
            $char = substr($content, $position, 1);
        }
        $count++;
        $position++;
    }
echo $count."n";
    // find out where there is a logical break before 100 characters
    $data = substr($content, 0, $position);

    $space = strrpos($data, " ");
    $tag = strrpos($data, ">");

    // return the position of the logical break
    if($space > $tag)
    {
        return $space;
    } else {
        return $tag;
    }  
}

Dies zählt auch die Rückgabecodes usw. Da sie Platz beanspruchen, habe ich sie nicht entfernt.