Chapter 3 Data transformation
In this section, we will mainly focus on two stuff:
- Transform
Timestamp
column into our needed format - Divide or sample from our original dataset to ensure effiency
3.1 Transforming Timestamp
We hope to add Four columns from Timestamp
:
Column names | Discription |
---|---|
Time | %Y-%m-%d %H:%M:%S Time format, representing China local time |
day | A string, representing the day of week |
Hour | An integer, representing the specific hour for the record |
Date | %Y-%m-%d Time format, representing China local time |
We do this in following steps:
- Mutate a new column
Time
usingas_datatime()
and settimezone
=Asia/Shanghai - Define the boundaries for each day using
as.numeric()
andas.POSIXct()
function. - Delete rows which fall out of our setting boundaries for
2017-11-26
and2017-12-2
(55576 rows in total) - Mutate a new column
day
using boundaries for each day with if…else clauses - Mutate new columns
hour
andDate
usinghour()
anddate()
function
3.2 Splitting datasets
There are three ways to split our datasets to make it easier to process:
- Splitting according to week of days, so we focus merely on the fluctuation on a certain day.
- Splitting according to Behavior type. In fact, the clicking behaviors shall occupy 90% data, we can ignore it if we do not focus on the behavior pattern of customers.
- Splitting according to User_ID. There are about 1 million users in the dataset and it’s actually unnecessary for us to just observe trend and make visualization. we can focus on the active buyers, or just randomly select them.
Finally, we choose the last method and randomly select 50,000 Users. There are 5,068,667 rows after selection, which is approximately 1/20 of original dataset and corresponds to the the size of Users pretty well. We write down this data into a new file called sample_data.csv and will focus on it in later chapters.
Please go to our github repo chapter to see code in detail.