AnsweredAssumed Answered

Does the reduce function need to output key or can I do just do the value?

Question asked by jake on Jun 27, 2013
Latest reply on Jun 28, 2013 by gera
Below are a sekelton of my map and reduce I am about to run on 1TB of data... but before I do I just want to make sure my logic works for Elastic Map Reduce and I am not about to waste a lot of my time haha :D

Basically I want to output my 100Gig CSV so it can be sucked into mysql. I have it output the values in CSV format but without the key. Do I need to add the key to the output of the reducer? Wont the mapper outputting key/value take care of the reducer by sending in all like data in key/value pairing for me and thus once the reducer outputs the value its done with that key/value pair?

My Question is also I will have 10,000+ files, and notice the reducer starts before the mapping is complete, does the reducer re-load data is reduced and re-reduce it? if so the key would need to be stored somewhere for the reducer to re-reduce the line therefore I couldn't just output the value, is this correct or am I over thinking it?

**map.py**

    for lines in sys.stdin:
        try:
            decoded = json.loads(lines)   
        id = values[0]
        genkey = values[2]
        value = values[4]
            for v in value:
                try:
                    #there is more to this, basically just makes a list (1,1,0,0,1,1,0,0,1) which becomes csv.
                    indexV = statment.index(str(v))
                    values[indexV] = 1
                except:
                    pass
    
            #print out the information we need
            print '%s-%s\t%s' % (id, genkey, values)
        except:
            pass


and my **reducer.py**


    for line in sys.stdin:
        try:
            keys, display = line.split('\t', 1)
            values = literal_eval(display)
            put = ""
            if current_keys == keys:
               pass
            else:
                if current_keys:
                    for key in current_values:
                        put += str(key)
                        put += ','
                    print '%s' % put
                current_keys = keys
                current_values = values
        except:
            pass

    if current_keys == keys:
        put = ""
        for key in current_values:
            put += str(key)
            outputString += ','
        print '%s' % put

Outcomes