python - Scraping data from multiple websites, merging the data and indexing in Elasticsearch -


i'm using scrapy scrape data on products (product name , manufacturer) website. i'm using pipeline (http://github.com/noplay/scrapy-elasticsearch) index data directly elasticsearch search engine. i'd scrape data site (either using api or scrapy again) provides data on manufacturers , reputation (a simple ranking of top 250 manufacturers example). in elasticsearch index example document might have following fields:

product name: ifruit 7 (scraped site a) product manufacturer: pear (scraped site , site b) manufacturer ranking: 17 (scraped site b) 

what simplest way combine scraped data in elasticsearch index each document stored information product name, manufacturer , product ranking? best try , merge data within scraping process, or try , combine 2 json files, or adapt pipeline, or mess around data once been indexed in elasticsearch? or there better solution?

it's possible manufacturer may spelled differently/phrased differently in 2 data sets well. how issue overcome?


Comments

Popular posts from this blog

yii2 - Yii 2 Running a Cron in the basic template -

asp.net - 'System.Web.HttpContext' does not contain a definition for 'GetOwinContext' Mystery -

mercurial graft feature, can it copy? -