Asynchronous job library that consume RabbitMQ for PDF urls and publish pdf text back.
Furkan Kalkan 8fa0efa2d9 Merge pull request #1 from cakmapilot/master | 2 years ago | |
---|---|---|
.github | 3 years ago | |
bin | 3 years ago | |
rabbitmq_pdfparser | 2 years ago | |
LICENSE | 3 years ago | |
README.md | 3 years ago | |
kube.yaml | 3 years ago | |
setup.py | 2 years ago |
rabbitmq_pdfparser is asynchronous job library that consume RabbitMQ for PDF urls and publish pdf text back to RabbitMQ. It stops when queue is empty.
You can install this library easily with pip.
pip install rabbitmq-pdfparser
Data must send to source queue should this format:
{"id": "foo", "url": "http://example.com/foo/bar.pdf"}
import os
import asyncio
from rabbitmq_pdfparser import consume
if __name__ == '__main__':
logger = logging.getLogger("rabbitmq_pdfparser")
logger.setLevel(os.environ.get('LOG_LEVEL', "DEBUG"))
handler = logging.StreamHandler()
handler.setFormatter(
logging.Formatter(
os.environ.get('LOG_FORMAT', "%(asctime)s [%(levelname)s] %(name)s: %(message)s")
)
)
logger.addHandler(handler)
config = {
"mq_host": os.environ.get('MQ_HOST'),
"mq_port": int(os.environ.get('MQ_PORT')),
"mq_vhost": os.environ.get('MQ_VHOST'),
"mq_user": os.environ.get('MQ_USER'),
"mq_pass": os.environ.get('MQ_PASS'),
"mq_source_queue": os.environ.get('MQ_SOURCE_QUEUE'),
"mq_target_exchange": os.environ.get('MQ_TARGET_EXCHANGE'),
"mq_target_routing_key": os.environ.get('MQ_TARGET_ROUTING_KEY')
}
loop = asyncio.get_event_loop()
loop.run_until_complete(
consume(
loop=loop,
consumer_pool_size=10,
config=config
)
)
loop.close()
This library uses PyPDF2, aio_pika and aiohttp packages.
You can also call this library as standalone PDF parser job. Just set required environment variables and run rabbitmq_pdfparser
. This usecase perfectly fits when you need run it on cronjobs or kubernetes jobs.
Required environment variables:
Example Kubernetes job: You can see it to kube.yaml