Here I (Mr Chuang) will talk about the process to generate the data.

I started with 2 Kaggle datasets, one with images but with no descriptions and one with all the details but no images. The sets had independent data and unfortunately there was not a easy one to go from Book data -> Image or image data -> description. You'd had to read the file, do a lookup, write it back in and save it. I'd rather just start from scratch at that point.

I decided I'd investigate on how the datasets in Kaggle were generated and investigated how they used Python to scrape the data from bookdepository.

It was easier to work from the one of the Scraper, I downloaded the soruce code and added a few extra features. Namely:

After scraping 60 books from each of the 32 categories, we finally have the dataset! The original set had 900 books for each category. If I had more time, we might have a more complete datatset!

Feel free to try and scrape your own data and see if you can do it! You might even be able to Scrape other websites...