Looking for an elegant solution that avoid merging two dataframes Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience Should we burninate the [wrap] tag? The Ask Question Wizard is Live!Dask: How would I parallelize my code with dask delayed?Using Python Faker generate different data for 5000 rowsHow to merge two dictionaries in a single expression?How to return multiple values from a function?Peak detection in a 2D array“Large data” work flows using pandasMerge a large Dask dataframe with a small Pandas dataframeusing Dask library to merge two large dataframesDask groupby transformUsing Dask with Python causes issues when running Pandas codedask - rolling apply - result.head() yields errorMask dataframe column based on datetime index

Why was the term "discrete" used in discrete logarithm?

List *all* the tuples!

English words in a non-english sci-fi novel

How to react to hostile behavior from a senior developer?

Why aren't air breathing engines used as small first stages

In predicate logic, does existential quantification (∃) include universal quantification (∀), i.e. can 'some' imply 'all'?

How discoverable are IPv6 addresses and AAAA names by potential attackers?

Why light coming from distant stars is not discreet?

3 doors, three guards, one stone

Why are there no cargo aircraft with "flying wing" design?

Using audio cues to encourage good posture

What is a non-alternating simple group with big order, but relatively few conjugacy classes?

Is it true that "carbohydrates are of no use for the basal metabolic need"?

Why is "Consequences inflicted." not a sentence?

When do you get frequent flier miles - when you buy, or when you fly?

Coloring maths inside a tcolorbox

The logistics of corpse disposal

Why is my conclusion inconsistent with the van't Hoff equation?

Dating a Former Employee

What is the meaning of the new sigil in Game of Thrones Season 8 intro?

Abandoning the Ordinary World

Echoing a tail command produces unexpected output?

When a candle burns, why does the top of wick glow if bottom of flame is hottest?

When were vectors invented?



Looking for an elegant solution that avoid merging two dataframes



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
Data science time! April 2019 and salary with experience
Should we burninate the [wrap] tag?
The Ask Question Wizard is Live!Dask: How would I parallelize my code with dask delayed?Using Python Faker generate different data for 5000 rowsHow to merge two dictionaries in a single expression?How to return multiple values from a function?Peak detection in a 2D array“Large data” work flows using pandasMerge a large Dask dataframe with a small Pandas dataframeusing Dask library to merge two large dataframesDask groupby transformUsing Dask with Python causes issues when running Pandas codedask - rolling apply - result.head() yields errorMask dataframe column based on datetime index



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I have a dask dataframe df that looks as follows:



Main_Author PaperID
A X
B Y
C Z


I also have another dask dataframe pa that looks as follows:



PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D


I want a resulting dataframe that looks as follows:



Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2


This is what I did:



df = df.merge(pa, on="PaperID")

df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))


This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?










share|improve this question




























    1















    I have a dask dataframe df that looks as follows:



    Main_Author PaperID
    A X
    B Y
    C Z


    I also have another dask dataframe pa that looks as follows:



    PaperID Co_Author
    X D
    X E
    X F
    Y A
    Z B
    Z D


    I want a resulting dataframe that looks as follows:



    Main_Author Co_Authors Num_Co_Authors
    A (D,E,F) 3
    B (A) 1
    C (B,D) 2


    This is what I did:



    df = df.merge(pa, on="PaperID")

    df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

    df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))


    This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?










    share|improve this question
























      1












      1








      1








      I have a dask dataframe df that looks as follows:



      Main_Author PaperID
      A X
      B Y
      C Z


      I also have another dask dataframe pa that looks as follows:



      PaperID Co_Author
      X D
      X E
      X F
      Y A
      Z B
      Z D


      I want a resulting dataframe that looks as follows:



      Main_Author Co_Authors Num_Co_Authors
      A (D,E,F) 3
      B (A) 1
      C (B,D) 2


      This is what I did:



      df = df.merge(pa, on="PaperID")

      df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

      df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))


      This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?










      share|improve this question














      I have a dask dataframe df that looks as follows:



      Main_Author PaperID
      A X
      B Y
      C Z


      I also have another dask dataframe pa that looks as follows:



      PaperID Co_Author
      X D
      X E
      X F
      Y A
      Z B
      Z D


      I want a resulting dataframe that looks as follows:



      Main_Author Co_Authors Num_Co_Authors
      A (D,E,F) 3
      B (A) 1
      C (B,D) 2


      This is what I did:



      df = df.merge(pa, on="PaperID")

      df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()

      df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))


      This works on small dataframes. However, since I am working with really large ones, it keeps getting killed. I believe it is because I am merging. Is there a more elegant way of getting the desired result?







      python python-3.x dask






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 8 at 17:40









      BKSBKS

      7081927




      7081927






















          1 Answer
          1






          active

          oldest

          votes


















          1














          If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed



          • there's a terrific example of dask.delayed here in the Dask docs or here on SO


          • see Dask use cases here


          Imports



          from faker import Faker
          import pandas as pd
          import dask
          from dask.diagnostics import ProgressBar
          import random
          fake = Faker()


          Generate dummy data in order to get a large number of rows in each DataFrame



          • Specify number of rows of dummy data to generate in each DataFrame

          number_of_rows_in_df = 3000
          number_of_rows_in_pa = 8000


          Generate some big dataset using the faker library (per this SO post)



          def create_rows(auth_colname, num=1):
          output = [auth_colname:fake.name(),
          "PaperID":random.randint(1000,2000) for x in range(num)]
          return output
          df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
          pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))


          Print first 5 rows of dataframes



          print(df.head())
          Main_Author PaperID
          0 Kyle Morton MD 1522
          1 April Edwards 1992
          2 Rachel Sullivan 1874
          3 Kevin Johnson 1909
          4 Julie Morton 1635

          print(pa.head())
          Co_Author PaperID
          0 Deborah Cuevas 1911
          1 Melissa Fox 1095
          2 Sean Mcguire 1620
          3 Cory Clarke 1424
          4 David White 1569


          Wrap the merge operation in a helper function



          def merge_operations(df1, df2):
          df = df1.merge(df2, on="PaperID")
          df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
          df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
          return df


          Dask approach - Generate final DataFrame using dask.delayed



          ddf = dask.delayed(merge_operations)(df, pa)
          with ProgressBar():
          df_dask = dask.compute(ddf)


          Output of Dask approach



          [ ] | 0% Completed | 0.0s
          [ ] | 0% Completed | 0.1s
          [ ] | 0% Completed | 0.2s
          [ ] | 0% Completed | 0.3s
          [ ] | 0% Completed | 0.4s
          [ ] | 0% Completed | 0.5s
          [########################################] | 100% Completed | 0.6s

          print(df_dask[0].head())
          Main_Author Co_Author Num_Co_Authors
          0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
          1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
          2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
          3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
          4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


          Pandas approach - Generate final DataFrame created using Pandas



          df_pandas = (merge_operations)(df, pa)

          print(df_pandas.head())
          Main_Author Co_Author Num_Co_Authors
          0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
          1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
          2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
          3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
          4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


          Compare DataFrames obtained using Pandas and Dask approaches



          from pandas.util.testing import assert_frame_equal
          try:
          assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
          except AssertionError as e:
          message = "n"+str(e)
          else:
          message = 'DataFrames created using Dask and Pandas are equivalent.'


          Result of comparing two approaches



          print(message)
          DataFrames created using Dask and Pandas are equivalent.





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068338%2flooking-for-an-elegant-solution-that-avoid-merging-two-dataframes%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed



            • there's a terrific example of dask.delayed here in the Dask docs or here on SO


            • see Dask use cases here


            Imports



            from faker import Faker
            import pandas as pd
            import dask
            from dask.diagnostics import ProgressBar
            import random
            fake = Faker()


            Generate dummy data in order to get a large number of rows in each DataFrame



            • Specify number of rows of dummy data to generate in each DataFrame

            number_of_rows_in_df = 3000
            number_of_rows_in_pa = 8000


            Generate some big dataset using the faker library (per this SO post)



            def create_rows(auth_colname, num=1):
            output = [auth_colname:fake.name(),
            "PaperID":random.randint(1000,2000) for x in range(num)]
            return output
            df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
            pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))


            Print first 5 rows of dataframes



            print(df.head())
            Main_Author PaperID
            0 Kyle Morton MD 1522
            1 April Edwards 1992
            2 Rachel Sullivan 1874
            3 Kevin Johnson 1909
            4 Julie Morton 1635

            print(pa.head())
            Co_Author PaperID
            0 Deborah Cuevas 1911
            1 Melissa Fox 1095
            2 Sean Mcguire 1620
            3 Cory Clarke 1424
            4 David White 1569


            Wrap the merge operation in a helper function



            def merge_operations(df1, df2):
            df = df1.merge(df2, on="PaperID")
            df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
            df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
            return df


            Dask approach - Generate final DataFrame using dask.delayed



            ddf = dask.delayed(merge_operations)(df, pa)
            with ProgressBar():
            df_dask = dask.compute(ddf)


            Output of Dask approach



            [ ] | 0% Completed | 0.0s
            [ ] | 0% Completed | 0.1s
            [ ] | 0% Completed | 0.2s
            [ ] | 0% Completed | 0.3s
            [ ] | 0% Completed | 0.4s
            [ ] | 0% Completed | 0.5s
            [########################################] | 100% Completed | 0.6s

            print(df_dask[0].head())
            Main_Author Co_Author Num_Co_Authors
            0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
            1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
            2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
            3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
            4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


            Pandas approach - Generate final DataFrame created using Pandas



            df_pandas = (merge_operations)(df, pa)

            print(df_pandas.head())
            Main_Author Co_Author Num_Co_Authors
            0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
            1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
            2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
            3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
            4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


            Compare DataFrames obtained using Pandas and Dask approaches



            from pandas.util.testing import assert_frame_equal
            try:
            assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
            except AssertionError as e:
            message = "n"+str(e)
            else:
            message = 'DataFrames created using Dask and Pandas are equivalent.'


            Result of comparing two approaches



            print(message)
            DataFrames created using Dask and Pandas are equivalent.





            share|improve this answer



























              1














              If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed



              • there's a terrific example of dask.delayed here in the Dask docs or here on SO


              • see Dask use cases here


              Imports



              from faker import Faker
              import pandas as pd
              import dask
              from dask.diagnostics import ProgressBar
              import random
              fake = Faker()


              Generate dummy data in order to get a large number of rows in each DataFrame



              • Specify number of rows of dummy data to generate in each DataFrame

              number_of_rows_in_df = 3000
              number_of_rows_in_pa = 8000


              Generate some big dataset using the faker library (per this SO post)



              def create_rows(auth_colname, num=1):
              output = [auth_colname:fake.name(),
              "PaperID":random.randint(1000,2000) for x in range(num)]
              return output
              df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
              pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))


              Print first 5 rows of dataframes



              print(df.head())
              Main_Author PaperID
              0 Kyle Morton MD 1522
              1 April Edwards 1992
              2 Rachel Sullivan 1874
              3 Kevin Johnson 1909
              4 Julie Morton 1635

              print(pa.head())
              Co_Author PaperID
              0 Deborah Cuevas 1911
              1 Melissa Fox 1095
              2 Sean Mcguire 1620
              3 Cory Clarke 1424
              4 David White 1569


              Wrap the merge operation in a helper function



              def merge_operations(df1, df2):
              df = df1.merge(df2, on="PaperID")
              df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
              df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
              return df


              Dask approach - Generate final DataFrame using dask.delayed



              ddf = dask.delayed(merge_operations)(df, pa)
              with ProgressBar():
              df_dask = dask.compute(ddf)


              Output of Dask approach



              [ ] | 0% Completed | 0.0s
              [ ] | 0% Completed | 0.1s
              [ ] | 0% Completed | 0.2s
              [ ] | 0% Completed | 0.3s
              [ ] | 0% Completed | 0.4s
              [ ] | 0% Completed | 0.5s
              [########################################] | 100% Completed | 0.6s

              print(df_dask[0].head())
              Main_Author Co_Author Num_Co_Authors
              0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
              1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
              2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
              3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
              4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


              Pandas approach - Generate final DataFrame created using Pandas



              df_pandas = (merge_operations)(df, pa)

              print(df_pandas.head())
              Main_Author Co_Author Num_Co_Authors
              0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
              1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
              2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
              3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
              4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


              Compare DataFrames obtained using Pandas and Dask approaches



              from pandas.util.testing import assert_frame_equal
              try:
              assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
              except AssertionError as e:
              message = "n"+str(e)
              else:
              message = 'DataFrames created using Dask and Pandas are equivalent.'


              Result of comparing two approaches



              print(message)
              DataFrames created using Dask and Pandas are equivalent.





              share|improve this answer

























                1












                1








                1







                If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed



                • there's a terrific example of dask.delayed here in the Dask docs or here on SO


                • see Dask use cases here


                Imports



                from faker import Faker
                import pandas as pd
                import dask
                from dask.diagnostics import ProgressBar
                import random
                fake = Faker()


                Generate dummy data in order to get a large number of rows in each DataFrame



                • Specify number of rows of dummy data to generate in each DataFrame

                number_of_rows_in_df = 3000
                number_of_rows_in_pa = 8000


                Generate some big dataset using the faker library (per this SO post)



                def create_rows(auth_colname, num=1):
                output = [auth_colname:fake.name(),
                "PaperID":random.randint(1000,2000) for x in range(num)]
                return output
                df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
                pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))


                Print first 5 rows of dataframes



                print(df.head())
                Main_Author PaperID
                0 Kyle Morton MD 1522
                1 April Edwards 1992
                2 Rachel Sullivan 1874
                3 Kevin Johnson 1909
                4 Julie Morton 1635

                print(pa.head())
                Co_Author PaperID
                0 Deborah Cuevas 1911
                1 Melissa Fox 1095
                2 Sean Mcguire 1620
                3 Cory Clarke 1424
                4 David White 1569


                Wrap the merge operation in a helper function



                def merge_operations(df1, df2):
                df = df1.merge(df2, on="PaperID")
                df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
                df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
                return df


                Dask approach - Generate final DataFrame using dask.delayed



                ddf = dask.delayed(merge_operations)(df, pa)
                with ProgressBar():
                df_dask = dask.compute(ddf)


                Output of Dask approach



                [ ] | 0% Completed | 0.0s
                [ ] | 0% Completed | 0.1s
                [ ] | 0% Completed | 0.2s
                [ ] | 0% Completed | 0.3s
                [ ] | 0% Completed | 0.4s
                [ ] | 0% Completed | 0.5s
                [########################################] | 100% Completed | 0.6s

                print(df_dask[0].head())
                Main_Author Co_Author Num_Co_Authors
                0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
                1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
                2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
                3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
                4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


                Pandas approach - Generate final DataFrame created using Pandas



                df_pandas = (merge_operations)(df, pa)

                print(df_pandas.head())
                Main_Author Co_Author Num_Co_Authors
                0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
                1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
                2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
                3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
                4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


                Compare DataFrames obtained using Pandas and Dask approaches



                from pandas.util.testing import assert_frame_equal
                try:
                assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
                except AssertionError as e:
                message = "n"+str(e)
                else:
                message = 'DataFrames created using Dask and Pandas are equivalent.'


                Result of comparing two approaches



                print(message)
                DataFrames created using Dask and Pandas are equivalent.





                share|improve this answer













                If you are looking to work with two large DataFrames, then you could try to wrap this merge in dask.delayed



                • there's a terrific example of dask.delayed here in the Dask docs or here on SO


                • see Dask use cases here


                Imports



                from faker import Faker
                import pandas as pd
                import dask
                from dask.diagnostics import ProgressBar
                import random
                fake = Faker()


                Generate dummy data in order to get a large number of rows in each DataFrame



                • Specify number of rows of dummy data to generate in each DataFrame

                number_of_rows_in_df = 3000
                number_of_rows_in_pa = 8000


                Generate some big dataset using the faker library (per this SO post)



                def create_rows(auth_colname, num=1):
                output = [auth_colname:fake.name(),
                "PaperID":random.randint(1000,2000) for x in range(num)]
                return output
                df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
                pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))


                Print first 5 rows of dataframes



                print(df.head())
                Main_Author PaperID
                0 Kyle Morton MD 1522
                1 April Edwards 1992
                2 Rachel Sullivan 1874
                3 Kevin Johnson 1909
                4 Julie Morton 1635

                print(pa.head())
                Co_Author PaperID
                0 Deborah Cuevas 1911
                1 Melissa Fox 1095
                2 Sean Mcguire 1620
                3 Cory Clarke 1424
                4 David White 1569


                Wrap the merge operation in a helper function



                def merge_operations(df1, df2):
                df = df1.merge(df2, on="PaperID")
                df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
                df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
                return df


                Dask approach - Generate final DataFrame using dask.delayed



                ddf = dask.delayed(merge_operations)(df, pa)
                with ProgressBar():
                df_dask = dask.compute(ddf)


                Output of Dask approach



                [ ] | 0% Completed | 0.0s
                [ ] | 0% Completed | 0.1s
                [ ] | 0% Completed | 0.2s
                [ ] | 0% Completed | 0.3s
                [ ] | 0% Completed | 0.4s
                [ ] | 0% Completed | 0.5s
                [########################################] | 100% Completed | 0.6s

                print(df_dask[0].head())
                Main_Author Co_Author Num_Co_Authors
                0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
                1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
                2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
                3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
                4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


                Pandas approach - Generate final DataFrame created using Pandas



                df_pandas = (merge_operations)(df, pa)

                print(df_pandas.head())
                Main_Author Co_Author Num_Co_Authors
                0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
                1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
                2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
                3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
                4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6


                Compare DataFrames obtained using Pandas and Dask approaches



                from pandas.util.testing import assert_frame_equal
                try:
                assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
                except AssertionError as e:
                message = "n"+str(e)
                else:
                message = 'DataFrames created using Dask and Pandas are equivalent.'


                Result of comparing two approaches



                print(message)
                DataFrames created using Dask and Pandas are equivalent.






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 8 at 21:26









                edeszedesz

                2,97472672




                2,97472672





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068338%2flooking-for-an-elegant-solution-that-avoid-merging-two-dataframes%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    1928 у кіно

                    Захаров Федір Захарович

                    Ель Греко