Chapter 2 Data sources

The Primary data source of this project is Alibaba’s cloud platform Tianchi. The data set contains 1 million randomly selected user behavior including click, purchase, adding item to shopping cart and item favording during November 25 to December 03 2017. Each row of the dataset represents a specific user-item interaction, with attributes user ID, item ID, item’s category ID, behavior type and timestamp in a CSV files. The following table shows the columns of the dataset:

Dataset: UserBehavior.csv:

Column names	Discription
User_ID	An integer, the serialized ID that represents a user
Item_ID	An integer, the serialized ID that represents an item
Category_ID	An integer, the serialized ID that represents the category which the corresponding item belongs to
Behavior_type	A string, enum-type from (‘pv’, ‘buy’, ‘cart’, ‘fav’)
Timestamp	An integer, the timestamp of the behavior

Problems with the dataset:

The dataset is too large for us to deal with since there are over 100 million entries. A ordinary personal computer cannot handle this file. Our solution is to split the data into serveral datesets or just sampling but ensuring representativeness at the same time.
The dataset contains one month’s user behavior on 2017. This might not be a accurate reflection on the most current user behavior, since there might be shopping festival around the time. If customers have just experienced a shopping festival, they are less likely to buy stuff since their need was satisfied not long ago, or there might be a sales promotion during the time. However, this is the most recent data we can collect and it’s really hard to track sales promotion 4 years ago.
Although user ID, item ID and category ID are provided in the dataset, the actual encoding of these ID’s are only avaliable to the company themselves. This bar us from further analyzing the data by categorizing user or items and we believe readers won’t prefer to see a long series of Item ID numbers.

Datasource: We applied for UserBehavior.csv through this link. Since this document is too large(1G after compression), we might only provide the access for sample_data which we will talk about later.