Replace scala-xml internal API usage with public APIs#2050
Replace scala-xml internal API usage with public APIs#2050farmdawgnation merged 6 commits intomainfrom
Conversation
Adds comprehensive characterization tests for Html5Parser, Html5Writer, PCDataXmlParser, AltXML, and MarkdownParser to lock in current behavior before refactoring the scala-xml internal API dependencies. New test coverage includes: AutoInsertedBody unwrapping, void/non-void tag rendering, CDATA round-trips as Lift PCData nodes, HTML entity resolution, namespace prefix handling, script/style verbatim content, unicode, malformed input failure cases, and full parse+serialize round-trips for both parser paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tml5Parser Use the public scala-xml API (XML.withXMLReader) instead of the unsupported NoBindingFactoryAdapter internal class. The nu.validator HtmlParser already implements XMLReader, so it integrates directly with XMLLoader.load(). Removes all scala.xml.parsing._ imports and internal field access from HtmlParser.scala. AutoInsertedBody stripping logic is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mplementation Replace the PCDataMarkupParser trait and PCDataXmlParser class (which extended ConstructingHandler, MarkupParser, and ExternalSources from scala.xml.parsing) with a SAX-based LiftSaxTreeBuilder that uses javax.xml.parsers.SAXParserFactory. Key design decisions: - namespace-aware=false to allow undeclared namespace prefixes (e.g. lift:) and surface xmlns:* as regular attributes for manual NamespaceBinding construction - LexicalHandler integration detects CDATA sections and produces Lift PCData nodes - Named HTML entities (e.g. ) preprocessed to numeric char refs before SAX parsing since a standard XML parser has no DTD with entity declarations - CDATA sections skipped during entity preprocessing to preserve their content - Public API (PCDataXmlParser.apply(String/InputStream) -> Box[NodeSeq]) unchanged - Preamble stripping logic (XML declaration + DOCTYPE) unchanged Removes all scala.xml.parsing.* imports from this file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove unused import scala.xml.parsing.XhtmlParser from BlockParsers.scala - Remove HtmlEntities.entities and apply() which used ParsedEntityDecl/IntDef from scala.xml.dtd (these were only used by the now-removed PCDataXmlParser class) - Remove import scala.xml.dtd._ from PCDataMarkupParser.scala Completes the removal of all scala.xml internal API usage from lift-util. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ures In specs2 5.x (used for Scala 3), XmlMatchers moved from specs2-matcher-extra to a separate specs2-xml artifact. Eight test files in scala-3/ imported org.specs2.matcher.XmlMatchers, causing compilation failures for all Scala 3.3.7 CI jobs. Also fix a type inference issue in HtmlRoundTripSpec where `must not contain` was ambiguous under Scala 3 + specs2 5.x; replaced with `must not(contain(...))`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates Lift’s XML/HTML parsing and serialization test coverage, and reworks PCDataXmlParser to use a SAX-based implementation (with added build support for Specs2 XML matchers on Scala 3).
Changes:
- Replaced the
PCDataXmlParserimplementation to parse via SAX and added preprocessing for named HTML entities and CDATA handling. - Updated
Html5Parserto parse viascala.xml.XML.withXMLReaderand expanded characterization/round-trip test coverage for HTML/Markdown/XML utilities. - Added conditional Specs2 XML dependency for Scala 3 test builds.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| project/Dependencies.scala | Adds specs2XmlDeps helper to include Specs2 XML module on Scala 3. |
| build.sbt | Wires Scala-version-dependent Specs2 XML test dependency into the build. |
| core/util/src/main/scala/net/liftweb/util/PCDataMarkupParser.scala | Replaces MarkupParser-based PCData XML parsing with a SAX-based parser and entity preprocessing. |
| core/util/src/main/scala/net/liftweb/util/HtmlParser.scala | Switches HTML5 parsing to XML.withXMLReader(hp).load and retains AutoInsertedBody unwrapping. |
| core/markdown/src/main/scala/net/liftweb/markdown/BlockParsers.scala | Removes unused XhtmlParser import. |
| core/util/src/test/scala-3/net/liftweb/util/XmlParserSpec.scala | Adds AltXML behavior tests (Scala 3). |
| core/util/src/test/scala-3/net/liftweb/util/PCDataXmlParserSpec.scala | Expands PCDataXmlParser characterization tests (Scala 3). |
| core/util/src/test/scala-3/net/liftweb/util/MarkdownParserSpec.scala | Adds characterization tests for MarkdownParser (Scala 3). |
| core/util/src/test/scala-3/net/liftweb/util/HtmlRoundTripSpec.scala | Adds round-trip tests for Html5 and PCDataXmlParser+AltXML (Scala 3). |
| core/util/src/test/scala-3/net/liftweb/util/Html5ParserSpec.scala | Expands Html5Parser/Writer test coverage (Scala 3). |
| core/util/src/test/scala-2.13/net/liftweb/util/XmlParserSpec.scala | Adds AltXML behavior tests (Scala 2.13). |
| core/util/src/test/scala-2.13/net/liftweb/util/PCDataXmlParserSpec.scala | Expands PCDataXmlParser characterization tests (Scala 2.13). |
| core/util/src/test/scala-2.13/net/liftweb/util/MarkdownParserSpec.scala | Adds characterization tests for MarkdownParser (Scala 2.13). |
| core/util/src/test/scala-2.13/net/liftweb/util/HtmlRoundTripSpec.scala | Adds round-trip tests for Html5 and PCDataXmlParser+AltXML (Scala 2.13). |
| core/util/src/test/scala-2.13/net/liftweb/util/Html5ParserSpec.scala | Expands Html5Parser/Writer test coverage (Scala 2.13). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| def apply(in: String): Box[NodeSeq] = { | ||
| var pos = 0 | ||
| val len = in.length | ||
| def moveToLT(): Unit = | ||
| while (pos < len && in.charAt(pos) != '<') pos += 1 | ||
|
|
||
| if ((ch != '/') && (ch != '>') && ('?' != ch)) | ||
| xSpace(); | ||
| } | ||
| moveToLT() | ||
|
|
||
| def findIt(base: MetaData, what: MetaData): MetaData = (base, what) match { | ||
| case (_, Null) => Null | ||
| case (upb: UnprefixedAttribute, upn: UnprefixedAttribute) if upb.key == upn.key => upn | ||
| case (pb: PrefixedAttribute, pn: PrefixedAttribute) if pb.key == pn.key && pb.pre == pn.key => pn | ||
| case _ => findIt(base, what.next) | ||
| // scan past <? ... ?> | ||
| if (pos + 1 < len && in.charAt(pos + 1) == '?') { | ||
| pos += 1 | ||
| moveToLT() | ||
| } | ||
|
|
||
| if(!aMap.wellformed(scope)) { | ||
| if (findIt(aMap, aMap.next) != Null) { | ||
| reportSyntaxError( "double attribute") | ||
| } | ||
| // scan past <!DOCTYPE ....> | ||
| if (pos + 1 < len && in.charAt(pos + 1) == '!') { | ||
| pos += 1 | ||
| moveToLT() | ||
| } | ||
|
|
||
| (aMap,scope) | ||
| parseXml(preprocessEntities(in.substring(pos))) |
There was a problem hiding this comment.
Some notes from Opus:
The logic does naively match any <! sequence and skip forward to the next <. However, Copilot overstates the practical risk:
Comments before the root element in well-formed XML are unusual in Lift’s use case (HTML templates).
The old MarkupParser-based code didn’t handle this either — it had a similar heuristic approach.
That said, the suggestion to explicitly match and >) is reasonable and low-effort. It would make the code more robust without much complexity.
We're already changing a lot here. I think this is more reasonable as a future improvement.
| private def parseXml(content: String): Box[NodeSeq] = | ||
| tryo { | ||
| // namespace-aware=false: undeclared prefixes (e.g. lift:) are allowed; xmlns | ||
| // attributes are surfaced as regular attributes so we can build NamespaceBinding | ||
| // manually, matching the behaviour of the previous MarkupParser-based impl. | ||
| val spf = SAXParserFactory.newInstance() | ||
| spf.setNamespaceAware(false) | ||
| val saxParser = spf.newSAXParser() | ||
| val xmlReader = saxParser.getXMLReader | ||
| val builder = new LiftSaxTreeBuilder | ||
| xmlReader.setContentHandler(builder) | ||
| xmlReader.setErrorHandler(builder) | ||
| xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler", builder) | ||
| xmlReader.parse(new InputSource(new StringReader(content))) | ||
| NodeSeq.fromSeq(Seq(builder.result)) |
There was a problem hiding this comment.
It's not clear to me what situation would result in untrusted entities running through this code path so I'm inclined to leave this alone for the moment.
Wrap parsing in try/finally so the InputStream is always closed, including when XML.withXMLReader(...).load() throws an exception. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
I've spun up seventhings with this branch and it seems to be working. Shipping this. |
Summary
Resolves GitHub issue #2032: removes all internal scala-xml dependency from lift-webkit's document parsing pipeline.
NoBindingFactoryAdapterwith publicXML.withXMLReader()APIMarkupParser/ConstructingHandlerwith SAX-basedLiftSaxTreeBuilderscala.xml.parsing.*andscala.xml.dtd.*importsNo breaking changes to public APIs. All 501 util tests pass.
Testing