IMPORTANT!!!

This warc-indexer repository has been moved and extracted from the multi maven module project webarchive-discovery maintained by British Library to netarchivesuite maintained by the Royal Danish Library. The Jira issues for the Warc-indexer has not been moved yet and can be found here: https://github.com/ukwa/webarchive-discovery/issues

WARC Indexer

Latest release 3.4.0: https://github.com/netarchivesuite/warc-indexer/releases/tag/3.4.0

This code runs Apache Tika on WARC and ARC records and extracts suitable metadata for indexing.

It is set up to work with Apache Solr, and our schema is provided in src/main/solr. The tests are able to spin-up an embedded Solr instance to verify the configuration and regression-test the indexer at the query level.

Using this command, it can also builds a suitable command-line tool for generating/posting Solr records from web archive files.

$ mvn clean install

Which runs like this:

$ java -jar target/warc-indexer-1.1.1-SNAPSHOT-jar-with-dependencies.jar \
-s http://localhost:8080/ \
src/test/resources/wikipedia-mona-lisa/flashfrozen-jwat-recompressed.warc.gz

TBA configuration HOW TO.

To print the default configuration:

$ java -cp target/warc-indexer-1.1.1-SNAPSHOT-jar-with-dependencies.jar uk.bl.wa.util.ConfigPrinter

To override the default with a new configuration:

$ java -jar target/warc-indexer-1.1.1-SNAPSHOT-jar-with-dependencies.jar -Dconfig.file=new.conf \
-s http://localhost:8080/ \
src/test/resources/wikipedia-mona-lisa/flashfrozen-jwat-recompressed.warc.gz

Note that this project also contains short ARC and WARC test files, taken from the [warc-test-corpus]

Annotations

Things to document:

Annotations format.
ACT client: uk.bl.wa.annotation.AnnotationsFromAct.main(String[]) > annotations.json
WARCIndexer, CLI and Hadoop versions.
Updater version: uk.bl.wa.annotation.Annotator.main(String[])

License

GNU General Public License Version 2

Name		Name	Last commit message	Last commit date
Latest commit History 1,023 Commits
annotations		annotations
conf		conf
src		src
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CHANGES.md		CHANGES.md
LICENSE.GPL2		LICENSE.GPL2
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMPORTANT!!!

WARC Indexer

Annotations

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IMPORTANT!!!

WARC Indexer

Annotations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages