apache-solr apache-tika indexing documents. Slow speed

I have 4GB ram.
running solr on 3GB memory.

I am extracting text and meta data using apache-tika server (tika-server.jar).

Files are taking much longer time than usual. 20 MB file is taking 2 – 3 minutes.

My server is hosted on amazon cloud. running ubuntu 14.04.

I have tested this on my local machine it extracts the data from same file in 1-2 secs.

is there a special configuration needed for amazon cloud instance. My local machine also has 4GB ram but its a MAC OS.

I am using tika-python to index my documents.

I have around 1 million documents in different file formats (pdf,htlm,doc,ppt,xml,txt)

Please suggest a remedy or an alternative solution to Apache-Tika.


Source: apache

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.