Skip to content

Replace scala-xml internal API usage with public APIs#2050

Merged
farmdawgnation merged 6 commits intomainfrom
msf/replace-scala-xml-parser
Mar 15, 2026
Merged

Replace scala-xml internal API usage with public APIs#2050
farmdawgnation merged 6 commits intomainfrom
msf/replace-scala-xml-parser

Conversation

@farmdawgnation
Copy link
Copy Markdown
Member

Summary

Resolves GitHub issue #2032: removes all internal scala-xml dependency from lift-webkit's document parsing pipeline.

  • Html5Parser: Replaced NoBindingFactoryAdapter with public XML.withXMLReader() API
  • PCDataXmlParser: Replaced MarkupParser/ConstructingHandler with SAX-based LiftSaxTreeBuilder
  • Removed all scala.xml.parsing.* and scala.xml.dtd.* imports
  • Added comprehensive characterization tests (80+ new tests) to lock in parsing behavior

No breaking changes to public APIs. All 501 util tests pass.

Testing

  • 501 tests passing, 0 failures, 3 pre-existing pending
  • HTML5 parsing: entity handling, namespace prefixes, CDATA preservation
  • XHTML parsing: named entity preprocessing, lift: prefix support, round-trip fidelity

farmdawgnation and others added 5 commits February 25, 2026 22:06
Adds comprehensive characterization tests for Html5Parser, Html5Writer,
PCDataXmlParser, AltXML, and MarkdownParser to lock in current behavior
before refactoring the scala-xml internal API dependencies.

New test coverage includes: AutoInsertedBody unwrapping, void/non-void tag
rendering, CDATA round-trips as Lift PCData nodes, HTML entity resolution,
namespace prefix handling, script/style verbatim content, unicode, malformed
input failure cases, and full parse+serialize round-trips for both parser paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tml5Parser

Use the public scala-xml API (XML.withXMLReader) instead of the unsupported
NoBindingFactoryAdapter internal class. The nu.validator HtmlParser already
implements XMLReader, so it integrates directly with XMLLoader.load().

Removes all scala.xml.parsing._ imports and internal field access from
HtmlParser.scala. AutoInsertedBody stripping logic is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mplementation

Replace the PCDataMarkupParser trait and PCDataXmlParser class (which extended
ConstructingHandler, MarkupParser, and ExternalSources from scala.xml.parsing)
with a SAX-based LiftSaxTreeBuilder that uses javax.xml.parsers.SAXParserFactory.

Key design decisions:
- namespace-aware=false to allow undeclared namespace prefixes (e.g. lift:)
  and surface xmlns:* as regular attributes for manual NamespaceBinding construction
- LexicalHandler integration detects CDATA sections and produces Lift PCData nodes
- Named HTML entities (e.g. &nbsp;) preprocessed to numeric char refs before
  SAX parsing since a standard XML parser has no DTD with entity declarations
- CDATA sections skipped during entity preprocessing to preserve their content
- Public API (PCDataXmlParser.apply(String/InputStream) -> Box[NodeSeq]) unchanged
- Preamble stripping logic (XML declaration + DOCTYPE) unchanged

Removes all scala.xml.parsing.* imports from this file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove unused import scala.xml.parsing.XhtmlParser from BlockParsers.scala
- Remove HtmlEntities.entities and apply() which used ParsedEntityDecl/IntDef
  from scala.xml.dtd (these were only used by the now-removed PCDataXmlParser class)
- Remove import scala.xml.dtd._ from PCDataMarkupParser.scala

Completes the removal of all scala.xml internal API usage from lift-util.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ures

In specs2 5.x (used for Scala 3), XmlMatchers moved from specs2-matcher-extra
to a separate specs2-xml artifact. Eight test files in scala-3/ imported
org.specs2.matcher.XmlMatchers, causing compilation failures for all Scala 3.3.7
CI jobs.

Also fix a type inference issue in HtmlRoundTripSpec where `must not contain`
was ambiguous under Scala 3 + specs2 5.x; replaced with `must not(contain(...))`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Lift’s XML/HTML parsing and serialization test coverage, and reworks PCDataXmlParser to use a SAX-based implementation (with added build support for Specs2 XML matchers on Scala 3).

Changes:

  • Replaced the PCDataXmlParser implementation to parse via SAX and added preprocessing for named HTML entities and CDATA handling.
  • Updated Html5Parser to parse via scala.xml.XML.withXMLReader and expanded characterization/round-trip test coverage for HTML/Markdown/XML utilities.
  • Added conditional Specs2 XML dependency for Scala 3 test builds.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
project/Dependencies.scala Adds specs2XmlDeps helper to include Specs2 XML module on Scala 3.
build.sbt Wires Scala-version-dependent Specs2 XML test dependency into the build.
core/util/src/main/scala/net/liftweb/util/PCDataMarkupParser.scala Replaces MarkupParser-based PCData XML parsing with a SAX-based parser and entity preprocessing.
core/util/src/main/scala/net/liftweb/util/HtmlParser.scala Switches HTML5 parsing to XML.withXMLReader(hp).load and retains AutoInsertedBody unwrapping.
core/markdown/src/main/scala/net/liftweb/markdown/BlockParsers.scala Removes unused XhtmlParser import.
core/util/src/test/scala-3/net/liftweb/util/XmlParserSpec.scala Adds AltXML behavior tests (Scala 3).
core/util/src/test/scala-3/net/liftweb/util/PCDataXmlParserSpec.scala Expands PCDataXmlParser characterization tests (Scala 3).
core/util/src/test/scala-3/net/liftweb/util/MarkdownParserSpec.scala Adds characterization tests for MarkdownParser (Scala 3).
core/util/src/test/scala-3/net/liftweb/util/HtmlRoundTripSpec.scala Adds round-trip tests for Html5 and PCDataXmlParser+AltXML (Scala 3).
core/util/src/test/scala-3/net/liftweb/util/Html5ParserSpec.scala Expands Html5Parser/Writer test coverage (Scala 3).
core/util/src/test/scala-2.13/net/liftweb/util/XmlParserSpec.scala Adds AltXML behavior tests (Scala 2.13).
core/util/src/test/scala-2.13/net/liftweb/util/PCDataXmlParserSpec.scala Expands PCDataXmlParser characterization tests (Scala 2.13).
core/util/src/test/scala-2.13/net/liftweb/util/MarkdownParserSpec.scala Adds characterization tests for MarkdownParser (Scala 2.13).
core/util/src/test/scala-2.13/net/liftweb/util/HtmlRoundTripSpec.scala Adds round-trip tests for Html5 and PCDataXmlParser+AltXML (Scala 2.13).
core/util/src/test/scala-2.13/net/liftweb/util/Html5ParserSpec.scala Expands Html5Parser/Writer test coverage (Scala 2.13).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +79 to +99
def apply(in: String): Box[NodeSeq] = {
var pos = 0
val len = in.length
def moveToLT(): Unit =
while (pos < len && in.charAt(pos) != '<') pos += 1

if ((ch != '/') && (ch != '>') && ('?' != ch))
xSpace();
}
moveToLT()

def findIt(base: MetaData, what: MetaData): MetaData = (base, what) match {
case (_, Null) => Null
case (upb: UnprefixedAttribute, upn: UnprefixedAttribute) if upb.key == upn.key => upn
case (pb: PrefixedAttribute, pn: PrefixedAttribute) if pb.key == pn.key && pb.pre == pn.key => pn
case _ => findIt(base, what.next)
// scan past <? ... ?>
if (pos + 1 < len && in.charAt(pos + 1) == '?') {
pos += 1
moveToLT()
}

if(!aMap.wellformed(scope)) {
if (findIt(aMap, aMap.next) != Null) {
reportSyntaxError( "double attribute")
}
// scan past <!DOCTYPE ....>
if (pos + 1 < len && in.charAt(pos + 1) == '!') {
pos += 1
moveToLT()
}

(aMap,scope)
parseXml(preprocessEntities(in.substring(pos)))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes from Opus:

The logic does naively match any <! sequence and skip forward to the next <. However, Copilot overstates the practical risk:

Comments before the root element in well-formed XML are unusual in Lift’s use case (HTML templates).
The old MarkupParser-based code didn’t handle this either — it had a similar heuristic approach.
That said, the suggestion to explicitly match and >) is reasonable and low-effort. It would make the code more robust without much complexity.

We're already changing a lot here. I think this is more reasonable as a future improvement.

Comment on lines +102 to +116
private def parseXml(content: String): Box[NodeSeq] =
tryo {
// namespace-aware=false: undeclared prefixes (e.g. lift:) are allowed; xmlns
// attributes are surfaced as regular attributes so we can build NamespaceBinding
// manually, matching the behaviour of the previous MarkupParser-based impl.
val spf = SAXParserFactory.newInstance()
spf.setNamespaceAware(false)
val saxParser = spf.newSAXParser()
val xmlReader = saxParser.getXMLReader
val builder = new LiftSaxTreeBuilder
xmlReader.setContentHandler(builder)
xmlReader.setErrorHandler(builder)
xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler", builder)
xmlReader.parse(new InputSource(new StringReader(content)))
NodeSeq.fromSeq(Seq(builder.result))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what situation would result in untrusted entities running through this code path so I'm inclined to leave this alone for the moment.

Wrap parsing in try/finally so the InputStream is always closed,
including when XML.withXMLReader(...).load() throws an exception.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@farmdawgnation
Copy link
Copy Markdown
Member Author

I've spun up seventhings with this branch and it seems to be working. Shipping this.

@farmdawgnation farmdawgnation merged commit a286926 into main Mar 15, 2026
6 checks passed
@farmdawgnation farmdawgnation deleted the msf/replace-scala-xml-parser branch March 15, 2026 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants