Saturday, August 30, 2008

The Numerati

Data Mining is continuously being considered, applied and adopted in new areas. The Numerati by Stephen Baker has a very interesting chapter called, "The Worker."

At IBM's Thomas J. Watson Research Center, a team of data miners, statisticians and anthropologists is building mathematical models of their colleagues (50,000 of IBM's tech consultants) to improve productivity and automate management. The idea is to pile up inventories of all of their skills and then to calculate mathematically (the job fit, for example), how best to deploy them.

Quoting the author, "IBM, for example, will also be able to place workers and their skills into the same type of analytic software that they use for financial projections. This way, they will project skills that will be needed (or in surplus) in coming years. This eventually could result in something like futures markets for skills and workers."

The data sources used for the modeling include resumes, project records, online calendars, cell phone & handheld computer usage, call records and emails, etc.

The article also mentions an interesting example of how an IBM manager can select and assign a team of five to set up a call center in Manila.

The criticism or shall we say the skepticism is directed at this idea that the complexity of highly intelligent knowledge workers can be translated into equations and algorithms. Comments left by readers include concerns about freedom, privacy, harassment by the management, discrimination, etc.

But how is this different from the racial profiling techniques used by the United States government after 9/11? Or, insurance agencies charging different premiums to persons based on their demographic profiles?

My guess is that, in the near future a few companies are going to adopt what IBM is currently doing, in some form or the other. According to IBM - the workforce has become too big, the world too vast and complicated for managers to get a grip on their workers the old-fashioned way.

And then one day, will your manager come up to you and say that you’ve been assigned a different role because your "job fit" with the work you are currently doing is only 72%? Will you get promoted in your team because you scored 1% higher than your colleague?

Or will an unmentioned and unwritten class system based on an employee’s score define the workplace of tomorrow?

Thursday, August 21, 2008

A Few Questions Before You Churn!

Everyone seems to be modeling customer churn these days. But before you roll up your sleeves and take a dive, here are a few things I learned from David Ogden’s webcasts.

How will you use your churn model?
- Do you want to identify/rank likely churners?
- Do you want to identify/quantify the churn drivers?

Data Collection Window
- How much historical data do you want to use – 3 years data, 5 years data?

Prediction Window
- Who will churn next month? Who will churn in the next 6 months?

You build a model and predict who will churn next month. But what if the client’s business is such that it usually takes 2-3 months to implement the results from your churn model - set up campaigns, target customers with customized retention offers, send out mailers, etc.? Understand the client’s business and decide on an appropriate prediction window before simply doing what they ask.

Involuntary Churn vs. Voluntary Churn
- Voluntary churn occurs when a customer decides to switch to a competitor or another service provider because of dissatisfaction with the service or the associated fees
- Involuntary churn occurs due to factors like relocation, death, non-payment, etc.

Sometimes models are built leaving out one or the other group of customers. There is a clear difference between the two; decide which one is more important for the client’s business.

Drivers vs. Indicators
- Both influence churn, but drivers are those factors/measures that the company can control or manipulate. Indicators are mostly demographic measures, macro-economic factors, or seasonality, and they are outside the company's control.

Expected time to churn, vs. probability to churn tomorrow
- Survival Time Modeling answers the question, “What is the expected time to churn?” The response variable here is the Time (months, weeks, etc. until a customer will churn).
- Binary Response Modeling answers the question – “Who is likely to churn next week/month/quarter?” The response variable here is the Churn Indicator (customer stays or leaves).

Monday, August 4, 2008

Log Transformation

One of the most commonly used data transformation method is taking the natural logs of the original values. Log transformation works for data where the errors/residuals get larger for larger values of the variable (s). And this trend occurs in most data because the error or change in the value of a variable is often a percent of the value rather than an absolute value. For the same percent error, a larger value of the variable means a larger absolute error, so errors are larger too.

For example, a 5% error translates into an error that is 5% of the value of the variable. If the original value is 100, the error is 5% x 100, or 5. If the original value is 500, the error becomes 5% x 500, or 25.


When we take logs, this multiplicative factor becomes an additive factor, because of the nature of logs.

log(X * error) = log(X) + log(error)

The percent error therefore becomes the same additive error, regardless of the original value of the variable. In other words, the non-uniform errors become uniform. And that's why taking logs of the variable(s) helps in meeting the requirements for our statistical analysis most of the times.

Reference
A New View of Statistics website