map - Having two sets of input combined on hadoop -


i have rather simple hadoop question i'll try present example

say have list of strings , large file , want each mapper process piece of file , 1 of strings in grep program.

how supposed that? under impression number of mappers result of inputsplits produced. run subsequent jobs, 1 each string, seems kinda... messy?

edit: not trying build grep map reduce version. used example of having 2 different inputs mapper. let's lists , b , mapper work on 1 element list , 1 element list b

so given problem experiences no data dependency result in need chaining jobs, option somehow share of list on mappers , input 1 element of list b each mapper?

what trying built type of prefixed look-up structure data. have giant text , set of strings. process has strong memory bottleneck, therefore after 1 chunk of text/1 string per mapper

mappers should able work independent , w/o side effects. parallelism can be, mapper tries match line patterns. each input processed once!

otherwise multiply each input line number of patterns. process each line single pattern. , run reducer afterwards. chainmapper solution of choice here. remember: line appear twice, if matches 2 patterns. want?

in opinion should prefer first scenario: each mapper processes line independently , checks against known patterns.

hint: can distribute patterns distributedcache feature mappers! ;-) input should splitted inputlineformat


Comments

Popular posts from this blog

javascript - Enclosure Memory Copies -

php - Replacing tags in braces, even nested tags, with regex -