Replace scala-xml internal API usage with public APIs by farmdawgnation · Pull Request #2050 · lift/framework

farmdawgnation · 2026-02-26T03:22:12Z

Summary

Resolves GitHub issue #2032: removes all internal scala-xml dependency from lift-webkit's document parsing pipeline.

Html5Parser: Replaced NoBindingFactoryAdapter with public XML.withXMLReader() API
PCDataXmlParser: Replaced MarkupParser/ConstructingHandler with SAX-based LiftSaxTreeBuilder
Removed all scala.xml.parsing.* and scala.xml.dtd.* imports
Added comprehensive characterization tests (80+ new tests) to lock in parsing behavior

No breaking changes to public APIs. All 501 util tests pass.

Testing

501 tests passing, 0 failures, 3 pre-existing pending
HTML5 parsing: entity handling, namespace prefixes, CDATA preservation
XHTML parsing: named entity preprocessing, lift: prefix support, round-trip fidelity

Adds comprehensive characterization tests for Html5Parser, Html5Writer, PCDataXmlParser, AltXML, and MarkdownParser to lock in current behavior before refactoring the scala-xml internal API dependencies. New test coverage includes: AutoInsertedBody unwrapping, void/non-void tag rendering, CDATA round-trips as Lift PCData nodes, HTML entity resolution, namespace prefix handling, script/style verbatim content, unicode, malformed input failure cases, and full parse+serialize round-trips for both parser paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tml5Parser Use the public scala-xml API (XML.withXMLReader) instead of the unsupported NoBindingFactoryAdapter internal class. The nu.validator HtmlParser already implements XMLReader, so it integrates directly with XMLLoader.load(). Removes all scala.xml.parsing._ imports and internal field access from HtmlParser.scala. AutoInsertedBody stripping logic is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mplementation Replace the PCDataMarkupParser trait and PCDataXmlParser class (which extended ConstructingHandler, MarkupParser, and ExternalSources from scala.xml.parsing) with a SAX-based LiftSaxTreeBuilder that uses javax.xml.parsers.SAXParserFactory. Key design decisions: - namespace-aware=false to allow undeclared namespace prefixes (e.g. lift:) and surface xmlns:* as regular attributes for manual NamespaceBinding construction - LexicalHandler integration detects CDATA sections and produces Lift PCData nodes - Named HTML entities (e.g.  ) preprocessed to numeric char refs before SAX parsing since a standard XML parser has no DTD with entity declarations - CDATA sections skipped during entity preprocessing to preserve their content - Public API (PCDataXmlParser.apply(String/InputStream) -> Box[NodeSeq]) unchanged - Preamble stripping logic (XML declaration + DOCTYPE) unchanged Removes all scala.xml.parsing.* imports from this file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove unused import scala.xml.parsing.XhtmlParser from BlockParsers.scala - Remove HtmlEntities.entities and apply() which used ParsedEntityDecl/IntDef from scala.xml.dtd (these were only used by the now-removed PCDataXmlParser class) - Remove import scala.xml.dtd._ from PCDataMarkupParser.scala Completes the removal of all scala.xml internal API usage from lift-util. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ures In specs2 5.x (used for Scala 3), XmlMatchers moved from specs2-matcher-extra to a separate specs2-xml artifact. Eight test files in scala-3/ imported org.specs2.matcher.XmlMatchers, causing compilation failures for all Scala 3.3.7 CI jobs. Also fix a type inference issue in HtmlRoundTripSpec where `must not contain` was ambiguous under Scala 3 + specs2 5.x; replaced with `must not(contain(...))`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates Lift’s XML/HTML parsing and serialization test coverage, and reworks PCDataXmlParser to use a SAX-based implementation (with added build support for Specs2 XML matchers on Scala 3).

Changes:

Replaced the PCDataXmlParser implementation to parse via SAX and added preprocessing for named HTML entities and CDATA handling.
Updated Html5Parser to parse via scala.xml.XML.withXMLReader and expanded characterization/round-trip test coverage for HTML/Markdown/XML utilities.
Added conditional Specs2 XML dependency for Scala 3 test builds.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
project/Dependencies.scala	Adds `specs2XmlDeps` helper to include Specs2 XML module on Scala 3.
build.sbt	Wires Scala-version-dependent Specs2 XML test dependency into the build.
core/util/src/main/scala/net/liftweb/util/PCDataMarkupParser.scala	Replaces MarkupParser-based PCData XML parsing with a SAX-based parser and entity preprocessing.
core/util/src/main/scala/net/liftweb/util/HtmlParser.scala	Switches HTML5 parsing to `XML.withXMLReader(hp).load` and retains AutoInsertedBody unwrapping.
core/markdown/src/main/scala/net/liftweb/markdown/BlockParsers.scala	Removes unused `XhtmlParser` import.
core/util/src/test/scala-3/net/liftweb/util/XmlParserSpec.scala	Adds AltXML behavior tests (Scala 3).
core/util/src/test/scala-3/net/liftweb/util/PCDataXmlParserSpec.scala	Expands PCDataXmlParser characterization tests (Scala 3).
core/util/src/test/scala-3/net/liftweb/util/MarkdownParserSpec.scala	Adds characterization tests for MarkdownParser (Scala 3).
core/util/src/test/scala-3/net/liftweb/util/HtmlRoundTripSpec.scala	Adds round-trip tests for Html5 and PCDataXmlParser+AltXML (Scala 3).
core/util/src/test/scala-3/net/liftweb/util/Html5ParserSpec.scala	Expands Html5Parser/Writer test coverage (Scala 3).
core/util/src/test/scala-2.13/net/liftweb/util/XmlParserSpec.scala	Adds AltXML behavior tests (Scala 2.13).
core/util/src/test/scala-2.13/net/liftweb/util/PCDataXmlParserSpec.scala	Expands PCDataXmlParser characterization tests (Scala 2.13).
core/util/src/test/scala-2.13/net/liftweb/util/MarkdownParserSpec.scala	Adds characterization tests for MarkdownParser (Scala 2.13).
core/util/src/test/scala-2.13/net/liftweb/util/HtmlRoundTripSpec.scala	Adds round-trip tests for Html5 and PCDataXmlParser+AltXML (Scala 2.13).
core/util/src/test/scala-2.13/net/liftweb/util/Html5ParserSpec.scala	Expands Html5Parser/Writer test coverage (Scala 2.13).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

farmdawgnation · 2026-03-15T03:45:26Z

core/util/src/main/scala/net/liftweb/util/PCDataMarkupParser.scala

+  def apply(in: String): Box[NodeSeq] = {
+    var pos = 0
+    val len = in.length
+    def moveToLT(): Unit =
+      while (pos < len && in.charAt(pos) != '<') pos += 1

-      if ((ch != '/') && (ch != '>') && ('?' != ch))
-      xSpace();
-    }
+    moveToLT()

-    def findIt(base: MetaData, what: MetaData): MetaData = (base, what) match {
-      case (_, Null) => Null
-      case (upb: UnprefixedAttribute, upn: UnprefixedAttribute) if upb.key == upn.key => upn
-      case (pb: PrefixedAttribute, pn: PrefixedAttribute) if pb.key == pn.key && pb.pre == pn.key => pn
-      case _ => findIt(base, what.next)
+    // scan past <? ... ?>
+    if (pos + 1 < len && in.charAt(pos + 1) == '?') {
+      pos += 1
+      moveToLT()
    }

-    if(!aMap.wellformed(scope)) {
-      if (findIt(aMap, aMap.next) != Null) {
-        reportSyntaxError( "double attribute")
-      }
+    // scan past <!DOCTYPE ....>
+    if (pos + 1 < len && in.charAt(pos + 1) == '!') {
+      pos += 1
+      moveToLT()
    }

-    (aMap,scope)
+    parseXml(preprocessEntities(in.substring(pos)))


Some notes from Opus:

The logic does naively match any <! sequence and skip forward to the next <. However, Copilot overstates the practical risk:

Comments before the root element in well-formed XML are unusual in Lift’s use case (HTML templates).
The old MarkupParser-based code didn’t handle this either — it had a similar heuristic approach.
That said, the suggestion to explicitly match and >) is reasonable and low-effort. It would make the code more robust without much complexity.

We're already changing a lot here. I think this is more reasonable as a future improvement.

farmdawgnation · 2026-03-15T03:44:17Z

core/util/src/main/scala/net/liftweb/util/PCDataMarkupParser.scala

+  private def parseXml(content: String): Box[NodeSeq] =
+    tryo {
+      // namespace-aware=false: undeclared prefixes (e.g. lift:) are allowed; xmlns
+      // attributes are surfaced as regular attributes so we can build NamespaceBinding
+      // manually, matching the behaviour of the previous MarkupParser-based impl.
+      val spf = SAXParserFactory.newInstance()
+      spf.setNamespaceAware(false)
+      val saxParser = spf.newSAXParser()
+      val xmlReader = saxParser.getXMLReader
+      val builder   = new LiftSaxTreeBuilder
+      xmlReader.setContentHandler(builder)
+      xmlReader.setErrorHandler(builder)
+      xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler", builder)
+      xmlReader.parse(new InputSource(new StringReader(content)))
+      NodeSeq.fromSeq(Seq(builder.result))


It's not clear to me what situation would result in untrusted entities running through this code path so I'm inclined to leave this alone for the moment.

core/util/src/main/scala/net/liftweb/util/HtmlParser.scala

Wrap parsing in try/finally so the InputStream is always closed, including when XML.withXMLReader(...).load() throws an exception. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

farmdawgnation · 2026-03-15T03:46:16Z

I've spun up seventhings with this branch and it seems to be working. Shipping this.

farmdawgnation and others added 5 commits February 25, 2026 22:06

farmdawgnation requested a review from Copilot March 15, 2026 03:15

Copilot started reviewing on behalf of farmdawgnation March 15, 2026 03:16 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

fix: close InputStream in Html5Parser.parse on failure

77e23d1

Wrap parsing in try/finally so the InputStream is always closed, including when XML.withXMLReader(...).load() throws an exception. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

farmdawgnation merged commit a286926 into main Mar 15, 2026
6 checks passed

farmdawgnation deleted the msf/replace-scala-xml-parser branch March 15, 2026 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace scala-xml internal API usage with public APIs#2050

Replace scala-xml internal API usage with public APIs#2050
farmdawgnation merged 6 commits intomainfrom
msf/replace-scala-xml-parser

farmdawgnation commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

farmdawgnation Mar 15, 2026

Uh oh!

farmdawgnation Mar 15, 2026

Uh oh!

Uh oh!

farmdawgnation commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

farmdawgnation commented Feb 26, 2026

Summary

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

farmdawgnation Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

farmdawgnation Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

farmdawgnation commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants