map - Having two sets of input combined on hadoop -
i have rather simple hadoop question i'll try present example
say have list of strings , large file , want each mapper process piece of file , 1 of strings in grep program.
how supposed that? under impression number of mappers result of inputsplits produced. run subsequent jobs, 1 each string, seems kinda... messy?
edit: not trying build grep map reduce version. used example of having 2 different inputs mapper. let's lists , b , mapper work on 1 element list , 1 element list b
so given problem experiences no data dependency result in need chaining jobs, option somehow share of list on mappers , input 1 element of list b each mapper?
what trying built type of prefixed look-up structure data. have giant text , set of strings. process has strong memory bottleneck, therefore after 1 chunk of text/1 string per mapper
mappers should able work independent , w/o side effects. parallelism can be, mapper tries match line patterns. each input processed once!
otherwise multiply each input line number of patterns. process each line single pattern. , run reducer afterwards. chainmapper
solution of choice here. remember: line appear twice, if matches 2 patterns. want?
in opinion should prefer first scenario: each mapper processes line independently , checks against known patterns.
hint: can distribute patterns distributedcache
feature mappers! ;-) input should splitted inputlineformat
Comments
Post a Comment