|
|
Something Something (mailinglists19@...) 2010-01-25, 23:22
If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class be instantiated only on one machine.. always? I mean if I have a cluster of say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to be instantiated only on 1 machine?
If answer is yes, then I will use static variable as a counter to see how may rows have been added to my HBase table so far. In my use case, I want to write only N number of rows to a table. Is there a better way to do this? Please let me know. Thanks.
-
Re: setNumReduceTasks(1)
Jeff Zhang (zjffdu@...) 2010-01-26, 00:09
*See my comments below*
On Mon, Jan 25, 2010 at 3:22 PM, Something Something < mailinglists19@gmail.com> wrote:
> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class > be instantiated only on one machine.. always? I mean if I have a cluster > of > say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to > be instantiated only on 1 machine? > > *--Yes* > If answer is yes, then I will use static variable as a counter to see how > may rows have been added to my HBase table so far. In my use case, I want > to write only N number of rows to a table. Is there a better way to do > this? Please let me know. Thanks. >
*--Maybe you can use Counter to track the number of rows you add to HBase, then you do not need to limit the reduce task as 1* -- Best Regards
Jeff Zhang
-
Re: setNumReduceTasks(1)
Mridul Muralidharan (mridulm@...) 2010-01-26, 07:08
Jeff Zhang wrote: > *See my comments below* > > On Mon, Jan 25, 2010 at 3:22 PM, Something Something < > mailinglists19@gmail.com> wrote: > >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class >> be instantiated only on one machine.. always? I mean if I have a cluster >> of >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to >> be instantiated only on 1 machine? >> >> *--Yes* > > >> If answer is yes, then I will use static variable as a counter to see how >> may rows have been added to my HBase table so far. In my use case, I want >> to write only N number of rows to a table. Is there a better way to do >> this? Please let me know. Thanks. >> > > *--Maybe you can use Counter to track the number of rows you add to HBase, > then you do not need to limit the reduce task as 1* > >
Counter's are not synchronized in 'real-time' : so you cant use that to limit at addition time imo. It is more for aggregation, not realtime messaging.
- Mridul
-
Re: setNumReduceTasks(1)
Alex Baranov (alex.baranov.v@...) 2010-01-28, 06:04
Since MapReduce programming model defines only one "communication" point between jobs - the one that occurs after all Map tasks are done and before Reduce tasks begin I believe that anyway the solution to your problem will come at a price of lower performance.
Although I don't think that while having 10 workers you should use "setNumReduceTasks(1)" tactics since the performance will be very degraded. Of course this depends on N number a lot: if N is quite small and everything you're going to output from your reduce task is going to lay down in one datanode then *may be* your strategy can be considered.
If N is really very big and there is lot of work to do before Reducers should stop, then I'd consider communication throught storing the info about a progress in DFS (implementation will not be straightforward though, since we don't want to affect performance a lot).
Alex.
On Tue, Jan 26, 2010 at 1:22 AM, Something Something < mailinglists19@gmail.com> wrote:
> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class > be instantiated only on one machine.. always? I mean if I have a cluster > of > say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to > be instantiated only on 1 machine? > > If answer is yes, then I will use static variable as a counter to see how > may rows have been added to my HBase table so far. In my use case, I want > to write only N number of rows to a table. Is there a better way to do > this? Please let me know. Thanks. >
-
Re: setNumReduceTasks(1)
Mridul Muralidharan (mridulm@...) 2010-01-28, 07:09
A possible solution is to emit only N rows from each mapper and then use 1 reduce task [*] - if value of N is not very high. So you end up with utmost m * N rows on reducer instead of full inputset - and so the limit can be done easier. If you ok with some sort of variance in the number of rows inserted (and if value of N is very high), you can do more interesting things like N/m' rows per mapper - and multiple reducers (r) : with assumtion that each reducer will see atleast N/r rows - and so you can limit to N/r per reducer : ofcourse, there is a possible error that gets introduced here ... Regards, Mridul
[*] Assuming you just want simple limit - nothing else. Also note, each mapper might want to emit N rows instead of 'tweaks' like N/m rows, since it is possible that multiple mappers might have less than N/m rows to emit to begin with ! Something Something wrote: > If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class > be instantiated only on one machine.. always? I mean if I have a cluster of > say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to > be instantiated only on 1 machine? > > If answer is yes, then I will use static variable as a counter to see how > may rows have been added to my HBase table so far. In my use case, I want > to write only N number of rows to a table. Is there a better way to do > this? Please let me know. Thanks.
-
Re: setNumReduceTasks(1)
Something Something (mailinglists19@...) 2010-01-29, 17:36
I am sorry, but I forgot to add one important piece of information.
I don't want to write any random N rows to the table. I want to write the *top* N rows - meaning - I want to write the "key" values of the Reducer in descending order. Does this make sense? Sorry for the confusion.
On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <mridulm@yahoo-inc.com > wrote:
> > A possible solution is to emit only N rows from each mapper and then use 1 > reduce task [*] - if value of N is not very high. > So you end up with utmost m * N rows on reducer instead of full inputset - > and so the limit can be done easier. > > > If you ok with some sort of variance in the number of rows inserted (and if > value of N is very high), you can do more interesting things like N/m' rows > per mapper - and multiple reducers (r) : with assumtion that each reducer > will see atleast N/r rows - and so you can limit to N/r per reducer : > ofcourse, there is a possible error that gets introduced here ... > > > Regards, > Mridul > > [*] Assuming you just want simple limit - nothing else. > Also note, each mapper might want to emit N rows instead of 'tweaks' like > N/m rows, since it is possible that multiple mappers might have less than > N/m rows to emit to begin with ! > > > > Something Something wrote: > >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the >> class >> be instantiated only on one machine.. always? I mean if I have a cluster >> of >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed >> to >> be instantiated only on 1 machine? >> >> If answer is yes, then I will use static variable as a counter to see how >> may rows have been added to my HBase table so far. In my use case, I want >> to write only N number of rows to a table. Is there a better way to do >> this? Please let me know. Thanks. >> > >
|