I have a Spark DataFrame loaded up in memory, and I want to take the mean (or any aggregate operation) over the columns. How would I do that? (In numpy, this is known as taking an operation over axis=1).
If one were calculating the mean of the DataFrame down the rows (axis=0), then this is already built in:
from pyspark.sql import functions as F
F.mean(...)
But is there a way to programmatically do this against the entries in the columns? For example, from the DataFrame below
+--+--+---+---+
|id|US| UK|Can|
+--+--+---+---+
| 1|50| 0| 0|
| 1| 0|100| 0|
| 1| 0| 0|125|
| 2|75| 0| 0|
+--+--+---+---+
Omitting id, the means would be
+------+
| mean|
+------+
| 16.66|
| 3