Chapter 3 Data transformation

In this section, we will mainly focus on two stuff:

  • Transform Timestamp column into our needed format
  • Divide or sample from our original dataset to ensure effiency

3.1 Transforming Timestamp

We hope to add Four columns from Timestamp:

Column names Discription
Time %Y-%m-%d %H:%M:%S Time format, representing China local time
day A string, representing the day of week
Hour An integer, representing the specific hour for the record
Date %Y-%m-%d Time format, representing China local time

We do this in following steps:

  • Mutate a new column Time using as_datatime() and set timezone=Asia/Shanghai
  • Define the boundaries for each day using as.numeric() and as.POSIXct() function.
  • Delete rows which fall out of our setting boundaries for 2017-11-26 and 2017-12-2(55576 rows in total)
  • Mutate a new column day using boundaries for each day with if…else clauses
  • Mutate new columns hour and Date using hour() and date() function

3.2 Splitting datasets

There are three ways to split our datasets to make it easier to process:

  • Splitting according to week of days, so we focus merely on the fluctuation on a certain day.
  • Splitting according to Behavior type. In fact, the clicking behaviors shall occupy 90% data, we can ignore it if we do not focus on the behavior pattern of customers.
  • Splitting according to User_ID. There are about 1 million users in the dataset and it’s actually unnecessary for us to just observe trend and make visualization. we can focus on the active buyers, or just randomly select them.

Finally, we choose the last method and randomly select 50,000 Users. There are 5,068,667 rows after selection, which is approximately 1/20 of original dataset and corresponds to the the size of Users pretty well. We write down this data into a new file called sample_data.csv and will focus on it in later chapters.

Please go to our github repo chapter to see code in detail.