Comparison of Ruby and Python's Pandas for Data refinement
July 09, 2017 | 11 Minute Read
My main tool for every day is Ruby, but a few months ago I started using Python for playing with data. I heard a lot that Python is heavily used by Data Scientists and scientists in general, but I didn’t expect that even for not a python-experienced developer it can give so much power. So today I want to briefly introduce you Python’s library pandas.
Here is a quote from the official website:
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
So it says that it’s easy to use data structures and data analysis tool, so let’s check how easy and good is it.
I will take Ebay Kleinanziegen dataset of over 370000 used cars sold on their platform. I’m using this source.
First of all, you should set yourself some goal of what insight do you want to get from your data. Here is the questions I want to know: Cars of which brand can be sold quicker in Germany.
I expect you already know how to work with Ruby so I’ll show here step by step guide how to implement it using Python and then just show the same operations with Ruby at once.
First, we need to read the CSV file into a dataset.
Next, let’s check how the data looks like because we need to know what do we need and what we can get rid of.
Looks a bit inconvenient, but I already can see few columns which are not I’m not interested in at all. Let’s clean our dataset from them.
Now let’s remove useless columns:
Next, let’s think about which data is really useful for us and which is not. Here are some ideas:
Too old cars are usually kind of retro and they have different pricing rules that regular ones. So I'll limit my data to have a year of registration to be >= than 1996.
People sell new car in the next year very rarely, so I can limit year to be < 2016
If a car has less than 50 horsepower it's probably not a car.
It should be used, so odometer data should be higher than 1000.
The car should be in Germany. So I'll limit the postal index.
In result, we have 239245 cars. It’s still enough for us.
Then to understand how fast a car had been sold, I’ll add a new column for it.
I don’t believe that a car can be sold in less than one hour. I had experience with it %)
Also if the car wasn’t sold for more than 6 months - seems that seller just forgot to mark it as sold.
Ok, preparation step is finished, so we can start answering the questions.
Looks monstrous, but with a small explanation, you can understand everything. Here I’m grouping cars by their brand and then for each calculating mean value of their soldWithin column. After that, I basically sort them and return 5 with a least soldWithin time range.
And in result we have TADAAAM!:
brand average
opel 8 days 22:28:41.981568
daewoo 8 days 22:37:20.325423
ford 9 days 02:04:24.859733
renault 9 days 05:29:40.067139
volkswagen 9 days 09:15:42.559146
I could expect this result :)
So again want to say, I’m not a professional data scientist neither Python developer. So some constructions here can be not optimal or aggregation function may be not absolutely representative, but it can still give you some insights about how cool are pandas and that if you need to work with CSV-like data in your daily job - you should consider turning this work into Python.
Here is Ruby version, as I said. Imho, much worse and muuuch slower to perform.