[GolemBridge] Add multi-page headings
Some checks failed
Tests / phpunit8 (8.1) (push) Has been cancelled
Tests / phpunit8 (8.2) (push) Has been cancelled
Tests / phpunit8 (8.3) (push) Has been cancelled
Tests / phpunit8 (8.4) (push) Has been cancelled
Build Image on Commit and Release / bake (push) Has been cancelled
Lint / phpcs (7.4) (push) Has been cancelled
Lint / phpcompatibility (7.4) (push) Has been cancelled
Lint / executable_php_files_check (push) Has been cancelled
Tests / phpunit8 (7.4) (push) Has been cancelled
Tests / phpunit8 (8.0) (push) Has been cancelled

On multi-page articles like [1], some paragraph headers were missing
because they are headers of the article pages.

These headers were previously removed in
c5f586497f for being redundant with the
original header. The article at [1] proves us wrong, but I added a logic
to ignore truly duplicate headers.

[1] https://www.golem.de/news/es-muss-nicht-immer-apple-sein-fuenf-ueberzeugende-airpods-pro-alternativen-im-test-2508-195000.html
This commit is contained in:
Mynacol 2025-08-17 11:57:00 +00:00
parent 876d3c8ae7
commit e30698f12f

View file

@ -132,13 +132,22 @@ class GolemBridge extends FeedExpander
// delete known bad elements
foreach (
$article->find('div[id*="adtile"], #job-market, #seminars, iframe,
.gbox_affiliate, div.toc') as $bad
.gbox_affiliate, div.toc') as $bad
) {
$bad->remove();
}
// reload html, as remove() is buggy
$article = str_get_html($article->outertext);
// Add multipage headers, but only if they are different to the article header
$firstHeader = $page->find('.table-jtoc td', 0);
if (isset($firstHeader)) {
$firstHeader = html_entity_decode($firstHeader->title);
}
$multipageHeader = $article->find('header.paged-cluster-header h1', 0);
if (isset($multipageHeader) && $multipageHeader->plaintext !== $firstHeader) {
$item .= $multipageHeader;
}
$header = $article->find('header', 0);
foreach ($header->find('p, figure') as $element) {