<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kurt Grandis &#187; Data</title>
	<atom:link href="http://kurtgrandis.com/blog/category/data/feed/" rel="self" type="application/rss+xml" />
	<link>http://kurtgrandis.com/blog</link>
	<description>Software Engineering &#38; Entrepreneurship</description>
	<lastBuildDate>Sat, 29 May 2010 14:21:55 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Python + Django vs. C# + ASP.NET: Productivity Showdown</title>
		<link>http://kurtgrandis.com/blog/2010/02/24/python-django-vs-c-asp-net-productivity-showdown/</link>
		<comments>http://kurtgrandis.com/blog/2010/02/24/python-django-vs-c-asp-net-productivity-showdown/#comments</comments>
		<pubDate>Wed, 24 Feb 2010 07:33:57 +0000</pubDate>
		<dc:creator>kurt</dc:creator>
				<category><![CDATA[Agile]]></category>
		<category><![CDATA[Data]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Entrepreneurship]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[asp.net]]></category>
		<category><![CDATA[c#]]></category>
		<category><![CDATA[storypoints]]></category>
		<category><![CDATA[velocity]]></category>

		<guid isPermaLink="false">http://kurtgrandis.com/blog/?p=148</guid>
		<description><![CDATA[People are often asking me how and why my department shifted from an ASP.NET environment to Django. I&#8217;ve finally gotten around to writing about the process leading up to our decision. I hope people out there find it useful in their own development groups and discussions.
Almost two years ago I was in a rather unlikely [...]]]></description>
			<content:encoded><![CDATA[<p>People are often asking me how and why my department shifted from an ASP.NET environment to Django. I&#8217;ve finally gotten around to writing about the process leading up to our decision. I hope people out there find it useful in their own development groups and discussions.</p>
<p>Almost two years ago I was in a rather unlikely situation in that I was running a software engineering department containing both a C# team and a Python team. The Python group was focused on building scientific computing and NLP-type applications, whereas the C# team was focused on building web applications.</p>
<p>A few of us Python folks in the department had already started playing around with Django&#8211;building internal web applications and projects outside of work. It did not take long for us to realize the power of Django and how quickly we were able to produce high-quality applications with little effort. This was my (strong) impression, but in order to propose a corporate platform shift I was going to need some data to support my claims.</p>
<p>It slowly dawned on me that I had a perfect test bed. Here we had two teams using different technology stacks within the same department. The same department. That means they shared the same development processes, project management tools, quality control measures, defect management processes. Everything was the same between these groups except for the technologies. Perfect! So like any good manager I turned my teams into unwitting guinea pigs.</p>
<h3>The Hypothesis</h3>
<p style="text-align: center;"><em>We can accomplish more with Python + Django than with C# + ASP.NET given the same amount of time without sacrificing quality</em></p>
<h3>Measuring Productivity</h3>
<p>For the sake of this study, I defined productivity as a normalized team velocity: how many story points were completed / developer / week. I record the normalized team velocity for each team&#8217;s sprint for later analysis.</p>
<p>For those of you unfamiliar with the concept story points I highly recommend Mike Cohn&#8217;s <a title="Agile Estimation and Planning" href="http://www.amazon.com/Agile-Estimating-Planning-Mike-Cohn/dp/0131479415">Agile Estimation and Planning</a>.</p>
<h3>WAIT! You can&#8217;t compare story points between teams!</h3>
<p>I hear this a lot. Yes, you can. The problem is that most people do not bother creating a common scale or continually calibrate their estimations (within or between groups). Generally, it&#8217;s way more work than most groups need to deal with and it doesn&#8217;t deliver much utility to most groups so it isn&#8217;t often discussed or practiced.</p>
<p>The methods described below should outline the additional calibration work that was performed to ensure a common estimation scale between the two teams.</p>
<h3>Methods</h3>
<p>Both teams continued business as usual working on projects in parallel. Each sprint consisted of 3-4 developers. It is worth noting that Team ASP.NET did not make use of MS MVC Framework, but they did use Linq-to-SQL for its ORMy powers.</p>
<p>Special care was taken to maintain linkage between the two team&#8217;s effort estimates. During sprint planning, each team would use a common story point calibration reference when making estimates. In order to detect any potential deviations in calibration, during several planning poker sessions I included stories that had already been estimated during previous sprints or by the other team; no significant deviations were found.</p>
<p>At the end of each sprint I would calculate the normalized developer velocity ( # of completed story points / developer / week ). These values were recorded for both teams. It should be noted that only Django-based sprints were used in analysis for Team Python.</p>
<p>I recorded results for approximately 6 months.</p>
<h3>Results</h3>
<div id="attachment_261" class="wp-caption alignnone" style="width: 497px"><a rel="attachment wp-att-261" href="http://kurtgrandis.com/blog/2010/02/24/python-django-vs-c-asp-net-productivity-showdown/django_asp_histo-2/"><img class="size-full wp-image-261   " title="Normalized Developer Velocities: C# + ASP.NET and Python + Django" src="http://kurtgrandis.com/blog/wp-content/uploads/2010/02/django_asp_histo1.png" alt="Normalized Sprint Velocities: C# + ASP.NET and Python + Django" width="487" height="367" /></a><p class="wp-caption-text">Normalized Developer Velocities: C# + ASP.NET and Python + Django</p></div>
<p>The above histogram shows the distribution of normalized velocities associated with each completed sprint. The table below summarizes the distribution of velocities associated each team.</p>
<table style="height: 150px;" border="1" width="470">
<tbody>
<tr style="text-align: center;">
<td style="text-align: left;">units:<br />
story points /<br />
developer /<br />
week</td>
<th>C#/ASP.NET</th>
<th>Python/Django</th>
</tr>
<tr style="text-align: center;">
<th style="text-align: left;">mean</th>
<th>5.8</th>
<th>11.6</th>
</tr>
<tr style="text-align: center;">
<td style="text-align: left;">stdev</td>
<td>2.9</td>
<td>2.7</td>
</tr>
<tr style="text-align: center;">
<td style="text-align: left;">min</td>
<td>.3</td>
<td>8.5</td>
</tr>
<tr style="text-align: center;">
<td style="text-align: left;">max</td>
<td>9.3</td>
<td>15.8</td>
</tr>
</tbody>
<caption>Summary statistics of each team&#8217;s normalized developer velocities</caption>
</table>
<p>The distribution of velocities between the two samples are similarly shaped, but have clear differences in their mean. <strong>The average velocity of a  C#/ASP.NET developer was found to be 5.8 story points/week. A Python/Django developer has an average velocity of 11.6 story points/week. Independent t-tests reveal these differences as being statistically significant (t(15) = 4.19, p&lt;7.8e-4).</strong></p>
<h3><strong>Discussions and Conclusion</strong></h3>
<p>Given our development processes <strong>we found the average productivity of a single Django developer to be equivalent to the output generated by two C# ASP.NET developers. Given equal-sized teams, Django allowed our developers to be twice as productive as our ASP.NET team.</strong></p>
<p>I suspect these results may actually reflect a lower bound of the productivity differences. It should be noted that about half of the Team Python developers, while fluent in Python, had not used Django before. They quickly learned Django, but it is possible this fluency disparity may have caused an unintended bias in results&#8211;handicapping overall Django velocity.</p>
<h3>Epilogue</h3>
<p>The productivity differences quantified by our findings were then included as part of an overall rationale to shift web-based development platforms. Along with overall velocity differences, the costs associated with maintaining each environment were considered: OS licensing and database licensing for development and production environments, as well as costs associated with development tools. I&#8217;m happy to say we are now a Python and Django shop.</p>
<p><strong>Updated:</strong></p>
<p>Several good questions over at <a href="http://news.ycombinator.com/item?id=1148748">Hacker News</a></p>
]]></content:encoded>
			<wfw:commentRss>http://kurtgrandis.com/blog/2010/02/24/python-django-vs-c-asp-net-productivity-showdown/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>GHAPACK: A Library for the Generalized Hebbian Algorithm</title>
		<link>http://kurtgrandis.com/blog/2009/02/08/ghapack-a-library-for-the-generalized-hebbian-algorithm/</link>
		<comments>http://kurtgrandis.com/blog/2009/02/08/ghapack-a-library-for-the-generalized-hebbian-algorithm/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 03:34:04 +0000</pubDate>
		<dc:creator>kurt</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[gha]]></category>
		<category><![CDATA[lsa]]></category>
		<category><![CDATA[lsi]]></category>
		<category><![CDATA[machinelearning]]></category>
		<category><![CDATA[svd]]></category>

		<guid isPermaLink="false">http://kurtgrandis.com/blog/?p=31</guid>
		<description><![CDATA[I recently joined a new open source project called GHAPACK. The project currently provides the functionality and the means to use the Generalized Hebbian Algorithm. I came across this project after banging my head against some of the practical limitations of Singular Value Decomposition (SVD). GHA is a Hebbian-based neural network-like algorithm that approximates SVD&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>I recently joined a new open source project called <a href="http://sourceforge.net/projects/ghapack/">GHAPACK</a>. The project currently provides the functionality and the means to use the Generalized Hebbian Algorithm. I came across this project after banging my head against some of the practical limitations of Singular Value Decomposition (SVD). GHA is a Hebbian-based neural network-like algorithm that approximates SVD&#8217;s ability to perform eigen decomposition. Its added bonus is that it allows for incremental training so you can refine your model with new data without having to recompute using the entire dataset.</p>
<h2>Your Trusty SVD Tool</h2>
<p>SVD is one of those tools that every machine learning practitioner and computational geek will pull out at some time or another. It&#8217;s a powerful matrix factorization technique that allows you to get at the matrix&#8217;s eigenvectors and eigenvalues. One of reason it tends to be used so often is the fact that it can be used on those pesky <em>M x N </em>matrices, which us data junkies tend to generate.</p>
<p>For most small problems I can just use scipy and numpy&#8217;s <em>svd</em> and never give it a second thought. <a href="http://www.netlib.org/lapack/">LAPACK&#8217;s</a> suite of SVD routines power the <em>svd</em> functions of scipy, numpy, and MATLAB among others. It is developed for dense matrices and processes them in their entirety. What happens when you start dealing with problems in high-dimensional space? Those dense representations and full processing are expensive. So, when your problem space is better suited for sparse matrices you tend to run into not enough memory, non-convergence&#8230;no SVD.</p>
<p>At the time I was considering a problem that would be well-suited for incremental training, meaning I did not want to have to rerun the entire dataset through SVD after adding a small set of new data; GHA lets you avoid that sort of inconvenience and approximates the same outcome (as far as my problem was concerned).</p>
<h2>GHAPACK</h2>
<p>GHAPACK is written by <a href="http://www.dcs.shef.ac.uk/~genevieve/lsa.html">Genevieve Gorrell</a> and based on her work using GHAs to perform Latent Semantic Analysis (LSA).</p>
<h3>&#8220;Offline&#8221; Calculations</h3>
<p>My first order of business upon joining the project was to get the offline training working. This allows you to compute a pseudo-SVD based on a massive matrix without having to load the whole thing into memory.  No more out-of-memory segfaults. Now, you&#8217;re just limited by the resiliency of your hardware. This is now working.</p>
<h3>Memory Management</h3>
<p>I addressed a few memory leaks, but will likely do some restructuring to optimize memory management.</p>
<h3>Resource Library</h3>
<p>I would like the core of the GHA magic to be extracted into a library that others could embed in their own projects. So, I intend to move core functionality into a library and restructure the existing apps into commandline tools that utilize those libs.</p>
<h3>Performance</h3>
<p>GHA, off-the-bat, is not known for its speed compared to some other eigen decomposition approaches. Besides that, there is room for some major gains in performance. Let&#8217;s see what we can squeeze out of GHAPACK and perhaps lean on things like BLAS.</p>
<h3>Testing Framework</h3>
<p>A testing framework that objectively keeps track of performance gains, while ensuring computational integrity through unit testing always makes refactoring work that much less stressful.</p>
<p>Lots of work to do and a hearty thanks to Professor Gorrell for letting me join her efforts.</p>
<hr />
<h2>Other SVD Resources</h2>
<p>There are other SVD libraries out there that will carry you farther if SVD is what you really want and not necessarily the means to an end.</p>
<p><a href="http://www.netlib.org/scalapack/">ScaLAPACK</a> has parallel SVD code, which creams LAPACK&#8217;s performance when you have access to multiple cores and/or MPI. <a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK</a> and <a href="http://www.netlib.org/svdpack/">SVDPACK</a> both offer Lanczos-based SVD solutions for sparse matrices with ARPACK being well-suited for parallel processing.</p>
]]></content:encoded>
			<wfw:commentRss>http://kurtgrandis.com/blog/2009/02/08/ghapack-a-library-for-the-generalized-hebbian-algorithm/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Edward Tufte: Presenting Data and Information</title>
		<link>http://kurtgrandis.com/blog/2008/03/30/edward-tufte-presenting-data-and-information/</link>
		<comments>http://kurtgrandis.com/blog/2008/03/30/edward-tufte-presenting-data-and-information/#comments</comments>
		<pubDate>Sun, 30 Mar 2008 07:13:14 +0000</pubDate>
		<dc:creator>kurt</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Presentation]]></category>

		<guid isPermaLink="false">http://kurtgrandis.com/blog/2008/03/30/edward-tufte-presenting-data-and-information/</guid>
		<description><![CDATA[I had a chance to attend an Edward Tufte class this past week and it truly was a pleasure. He has published a number of beautiful books on presentation and the visualization of data. So, it was quite a treat to sit in on a presentation by someone that teaches about giving presentations for a [...]]]></description>
			<content:encoded><![CDATA[<p>I had a chance to attend an Edward Tufte class this past week and it truly was a pleasure. He has published a number of beautiful books on presentation and the visualization of data. So, it was quite a treat to sit in on a presentation by someone that teaches about giving presentations for a living. The class was engaging, full of content, and certainly left me with a sense of excitement.</p>
<h3>the class</h3>
<p>One big take away for me was that clutter and the sense of being overwhelmed by data is not an attribute of too much information, but rather a consequence of poor design. How many times have you looked at an information dashboard or a chart in a meeting only to get a headache from trying to grasp what was trying to be communicated? But yet we are capable of navigating and internalizing large amount of information if it is properly displayed and explained; those are the truly elegant presentational designs.</p>
<p>The class covers the basic principles you would want to follow to present your data in such a way to tell a story &#8212; a persuasive one. Things like how we can layout and present data to facilitate the basic intellectual process that one goes through when considering and weighing a proposal or story. I went into the class thinking I would learn some better ways to visualize and display complex datasets. I think I have some better ideas in this area, but only as a result of the bigger insight I walked away with on how to make a better presentation.</p>
<p>The other majorly cool bonus was being less than a foot away from a 1st edition Galileo printing. This along with an early printing of Euclid&#8217;s helped demonstrate the power of &#8220;breaking out of flatspace&#8221; by bringing something physical into a meeting.</p>
<h3>bringing it home</h3>
<p>So after all this excitement I went home and tried to look for ways to integrate Tufte&#8217;s design principles into my own presentations and reports. Edward Tufte&#8217;s website has a rich forum called <a href="https://www.edwardtufte.com/bboard/q-and-a?topic_id=1">Ask E.T.</a>, which contains information about presentation in a number of areas including <a href="https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=000076&amp;topic_id=1&amp;topic=Ask+E%2eT%2e">project management</a>. One Ask E.T. thread lead me to a project on Google Code that contains a <a href="http://code.google.com/p/tufte-latex/">Tufte-inspired LaTeX layout</a>.</p>
<p>I expect to play around with some of these designs and see how I might better polish my own reports.</p>
]]></content:encoded>
			<wfw:commentRss>http://kurtgrandis.com/blog/2008/03/30/edward-tufte-presenting-data-and-information/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
