Surveying Mechanical Turk to Validate a Startup Idea

I was intrigued by Lindsey Harper’s post, “How I Used Amazon’s Mechanical Turk to Validate my Startup Idea.” If you’ve ever worked with market research firms, built your own panels, or have hit the pavement trying to collect your own market research you know it can be expensive and/or time consuming. The idea of having a broad and cheap sounding board available online was very appealing.

I figured I would give it a shot and run a few tests through the Mechanical Turk and see how it stacked up against some more traditional market research options. I grabbed my latest business idea–viability untested–and set off for Amazon.

Testing Business Viability

Dude! We totally just made 15 cents
Dude! We totally just made 15 cents!

It’s worth noting the startup I was working on was a subscription-based consumer service geared towards parents of younger children and their grandparents. As Lindsey described in her article you get no segmentation or guaranteed panel refinement on Mechanical Turk so I was at the mercy of self-selection. I specified in the task description I was looking for parents of children of a certain age and let it go.

I posed some very basic demographic questions (e.g. gender, age(s) of their children, age). Once I had some basic information on the respondents I probed if they face the problem my service intends to solve. Once I described the service, the survey asked how likely they would be to use it and how likely they would be to recommend it to others. There were also a few service-specific questions, some open-ended responses including, “why would you not use the service,” and a general thoughts and feedback form.

I actually had some fancier survey question types than I cared to implement through Amazon’s Mechanical Turk API so instead I hosted the survey over at SurveyMonkey and had the respondents enter a confirmation code into MTurk upon completion.

Results: Well look at that…

Not bad. MTurkers ended up providing fairly similar answers to those I received in the wilds. After some light data trimming the data sets were very similar. Responses to the “How likely would you be to use this service” question were pretty similar between the MTurk panel and my other groups; statistically there was about an 80% chance the response groups were pulled from the same population. The response patterns were slightly shifted, but the overall outcome was the same.

The data trimming was done to account for a larger than expected number of MTurk respondents who were very price conscientious. Their responses described ongoing harsh economic conditions, the need to save money, and other general hardships. These folks were generally not represented or targeted in my other surveys.

As a bonus, the optional open-ended responses given by MTurk respondents were thoughtful and very useful. I was not expecting this level of detail. The optional open-ended question about general thoughts and feedback elicited a 47% response rate with an average of 40 words per response. That mean came with a standard deviation of 32 words per response–there were some really thoughtful responses in there.

Semifinal Thoughts

Would I use Amazon’s Mechanical Turk for this purpose again? I think so. It seems to be a good way to get a general feel for your idea and certainly grab some helpful feedback. The responses I received led me to believe it was a very thoughtful community.

In no way am I endorsing a survey of this type to be the entirety of your market research. This is a cheap and easy way to get some feelers out there and validate that you’re not (too) crazy. In the end it is still very important to get out there yourself and talk with potential customers early on.

FYI: The result of this work has become GlitterDuck. I am getting ready to start some pilot runs soon. If you are interested in learning more or becoming a beta tester please sign up over at the site.

Python + Django vs. C# + ASP.NET: Productivity Showdown

People are often asking me how and why my department shifted from an ASP.NET environment to Django. I’ve finally gotten around to writing about the process leading up to our decision. I hope people out there find it useful in their own development groups and discussions.

Almost two years ago I was in a rather unlikely situation in that I was running a software engineering department containing both a C# team and a Python team. The Python group was focused on building scientific computing and NLP-type applications, whereas the C# team was focused on building web applications.

A few of us Python folks in the department had already started playing around with Django–building internal web applications and projects outside of work. It did not take long for us to realize the power of Django and how quickly we were able to produce high-quality applications with little effort. This was my (strong) impression, but in order to propose a corporate platform shift I was going to need some data to support my claims.

It slowly dawned on me that I had a perfect test bed. Here we had two teams using different technology stacks within the same department. The same department. That means they shared the same development processes, project management tools, quality control measures, defect management processes. Everything was the same between these groups except for the technologies. Perfect! So like any good manager I turned my teams into unwitting guinea pigs.

The Hypothesis

We can accomplish more with Python + Django than with C# + ASP.NET given the same amount of time without sacrificing quality

Measuring Productivity

For the sake of this study, I defined productivity as a normalized team velocity: how many story points were completed / developer / week. I record the normalized team velocity for each team’s sprint for later analysis.

For those of you unfamiliar with the concept story points I highly recommend Mike Cohn’s Agile Estimation and Planning.

WAIT! You can’t compare story points between teams!

I hear this a lot. Yes, you can. The problem is that most people do not bother creating a common scale or continually calibrate their estimations (within or between groups). Generally, it’s way more work than most groups need to deal with and it doesn’t deliver much utility to most groups so it isn’t often discussed or practiced.

The methods described below should outline the additional calibration work that was performed to ensure a common estimation scale between the two teams.

Methods

Both teams continued business as usual working on projects in parallel. Each sprint consisted of 3-4 developers. It is worth noting that Team ASP.NET did not make use of MS MVC Framework, but they did use Linq-to-SQL for its ORMy powers.

Special care was taken to maintain linkage between the two team’s effort estimates. During sprint planning, each team would use a common story point calibration reference when making estimates. In order to detect any potential deviations in calibration, during several planning poker sessions I included stories that had already been estimated during previous sprints or by the other team; no significant deviations were found.

At the end of each sprint I would calculate the normalized developer velocity ( # of completed story points / developer / week ). These values were recorded for both teams. It should be noted that only Django-based sprints were used in analysis for Team Python.

I recorded results for approximately 6 months.

Results

Django ASP Histo

Normalized Developer Velocities: C# + ASP.NET and Python + Django

The above histogram shows the distribution of normalized velocities associated with each completed sprint. The table below summarizes the distribution of velocities associated each team.

Summary statistics of each team’s normalized developer velocities

Units: story points / developer / weekC#/ASP.NETPython/Django
mean5.811.6
stdev2.92.7
min.38.5
max9.315.8

The distribution of velocities between the two samples are similarly shaped, but have clear differences in their mean. The average velocity of a C#/ASP.NET developer was found to be 5.8 story points/week. A Python/Django developer has an average velocity of 11.6 story points/week. Independent t-tests reveal these differences as being statistically significant (t(15) = 4.19, p<7.8e-4).

Discussions and Conclusion

Given our development processes we found the average productivity of a single Django developer to be equivalent to the output generated by two C# ASP.NET developers. Given equal-sized teams, Django allowed our developers to be twice as productive as our ASP.NET team.

I suspect these results may actually reflect a lower bound of the productivity differences. It should be noted that about half of the Team Python developers, while fluent in Python, had not used Django before. They quickly learned Django, but it is possible this fluency disparity may have caused an unintended bias in results–handicapping overall Django velocity.

Epilogue

The productivity differences quantified by our findings were then included as part of an overall rationale to shift web-based development platforms. Along with overall velocity differences, the costs associated with maintaining each environment were considered: OS licensing and database licensing for development and production environments, as well as costs associated with development tools. I’m happy to say we are now a Python and Django shop.

Updated:

Several good questions over at Hacker News

Agile Scapegoating and People over Process

Last week James Shore posted an article The Decline and Fall of Agile that generated quite a bit of discussion. He points out many of atrocities and failures in software development and software project management done under guise of “Agile” or “Scrum” are often not true implementations. You have these teams who say they are doing Scrum, but the only things that actually get adopted are sprints and scrums. So many of these failed groups ignore the important and difficult aspects of Scrum like self-organization, shippable product goals, and self-reflection & improvement. As Jason says, they are “having their dessert every night and skipping their vegetables.”

Ken Schwaber replied to Jason’s article:

When Scrum is mis-used, the results are iterative, incrementa​l death marches. This is simply waterfall done more often without any of the risk mitigation techniques or people and engineerin​g practices needed to make Scrum work.

The article also sparked Bob Martin to write an essay “Dirty Rotten ScrumDrels” in response to some of the Scrum scapegoating that has been going on recently. Check out the comments for some Uncle Bob-Shore dialog.

Things like self-reflection and process and work reviews are all practices that people naturally adopt to improve themselves. I think successful developers tend to do this anyways in order to stay ahead of the curve. When you have mediocre teams flailing about who are looking for a silver bullet and turn to things like Scrum who don’t already do this sort of thing, it is easy for them to brush that off as a triviality; if they haven’t already seen the value of it, it just gets lost in the noise.

I share Mr. Shore’s frustration in seeing Agile and Scrum being blamed for the shortcomings of teams and improper implementations, when you’ve seen the real thing work over and over.