last updated before final submission, by Sorin

b6607d23 · Lucian Petrică · 6ada80c0 · b6607d23 · b6607d23 · b6607d23
Commit b6607d23 authored 10 years ago by Lucian Petrică
--- a/doc/papers/date2015/abstract.tex
+++ b/doc/papers/date2015/abstract.tex
 \begin{abstract}
-As FPGAs speed, power efficiency, and logic capacity are increasing, so does the number of applications which make use of FPGA processors.
+As FPGAs speed, power efficiency, and logic capacity are increasing, so does the number of applications which make use of FPGA processors. However, due to placement and routing constraints, FPGA processors instruction delay balancing is a real challenge, especially when the implementation approaches the FPGA resource capacity. 
-However, due to placement and routing constraints, FPGA processors instruction delay balancing is a real challenge especially when the
-implementation approaches the FPGA resource capacity. 
 Consequently, even though some instructions can operate at high frequencies, 
 the slow instructions determine the processor clock period, resulting in the underutilisation of the processor potential.
-However, the fast instructions latent performance may be harnessed through Adaptive Clock Management (ACM), i.e., by dynamically adapting the clock
+However, the fast instructions latent performance may be harnessed through Adaptive Clock Management (ACM), i.e., by dynamically adapting the clock frequency such that each instruction gets sufficient time for correct completion. 
-frequency such that each instruction gets sufficient time for correct completion. 
+Up to date, ACM augmented FPGA processors have been proposed based on Clock Multiplexing (CM), but they suffer from long clock switching delays, which could nullify most of the ACM potential performance gain.
-Up to date, ACM augmented FPGA processors have been proposed based on clock
-multiplexing, but they suffer from long clock switching delays, which nullify most of the ACM potential performance gain.
 This paper proposes an effective FPGA tailored clock manipulation approach able to leverage the ACM potential.
-We first evaluate clock stretching, i.e., the temporary clock period augmentation, as an alternative to clock multiplexing in FPGA processor
+We first evaluate Clock Stretching (CS), i.e., the temporary clock period augmentation, as a CM alternative in FPGA processor designs and introduce an FPGA specific CS circuit implementation.  
-designs and introduce an FPGA specific clock stretching circuit implementation.  
+Subsequently, we evaluate the advantages and drawbacks of the two techniques and propose a Hybrid ACM, which monitors the processor instruction stream and determines the optimal adaptive clocking strategy in order to provide the maximum speedup for the executing program. Given that CS has very low latency at the expense of limited accuracy and dynamic range we rely on it when the program requires frequent clock period changes.  Otherwise we utilise CM, which is rather slow but enables the FPGA processor operation at the edge of its hardware capabilities. 
-Subsequently, we evaluate the advantages and drawbacks of the two techniques and propose a hybrid ACM, which monitors the processor
+We evaluate our proposal on a vector processor mapped on a Xilinx Zynq FPGA. Our experiments indicate that on 
-instruction stream and determines the optimal adaptive clocking strategy in order to provide the maximum speedup for the executing program.
+Sum of Squared Differences algorithm, Neural network, and FIR filter execution traces the hybrid ACM provides up to $14$\% performance increase over the CM based ACM.
-Given that clock stretching has very low latency at the expense of limited accuracy and dynamic range we rely on clock stretching when the
-program requires frequent clock period changes.  
-Otherwise we utilise clock multiplexing, which is rather slow but enables the FPGA processor operation at the edge of its hardware capabilities. 
-We evaluate our proposal on a hybrid ACM augmented FPGA vector processor mapped on a Xilinx Zynq FPGA. Our experiments indicate that on 
-sum of squared differences algorithm, neural network and FIR filter execution traces the hybrid ACM provides up to 14\% performance increase over 
-the clock multiplexing based ACM.
 \end{abstract}
\ No newline at end of file
--- a/doc/papers/date2015/acm_related_work.tex
+++ b/doc/papers/date2015/acm_related_work.tex
@@ -37,12 +37,12 @@ Adaptive clock management has been previously proposed for both ASIC and FPGA ci
 The authors of \cite{petrica_vasile_2013} propose ACM as a work-around for critical paths caused by the long 
 routes to embedded FPGA multipliers, in the context of FPGA vector processing. 
 The proposed solution is based on multiplexing between a slow clock and a fast one. 
-The vector processor communicates whether the instruction in the execution stage is a multiplication, in which case the slow clock is selected, otherwise the fast clock is selected. An up to 28\% performance improvement is reported but the long delays associated with clock sources switching between result in slow-down for some benchmarks. The authors apply instruction reordering compilation techniques in order to reduce the number of clock switches and therefore minimize the clock switching penalties but the effectiveness of this approach is limited by data-dependencies.
+The vector processor communicates whether the instruction in the execution stage is a multiplication, in which case the slow clock is selected, otherwise the fast clock is selected. An up to $28$\% performance improvement is reported but the long delays associated with clock sources switching between result in slow-down for some benchmarks. The authors apply instruction reordering compilation techniques in order to reduce the number of clock switches and therefore minimize the clock switching penalties but the effectiveness of this approach is limited by data-dependencies.
-The authors of \cite{chae_dynamic_2010} propose an ASIC tailored Clock Stretching (CS) mechanism capable of extending the clock period by 25\% when slow paths are in use, enabling the circuit operation at increased average clock speed. The proposed adaptive solution requires a specialized Flip-Flop element, to be utilized along critical paths, and a dedicated circuit which generates the stretchable clock signal from four identical-frequency clocks spaced at $90$ degree phase intervals. A 10\% performance increase is claimed at a 10\% critical path activation probability.  An analysis of adder architectures in the context of adaptive CS is performed in \cite{ghosh_arithmetic_2008}, and the authors identify adder architectures which benefit most from the adaptive clocking technique. 
+The authors of \cite{chae_dynamic_2010} propose an ASIC tailored Clock Stretching (CS) mechanism capable of extending the clock period by $25$\% when slow paths are in use, enabling the circuit operation at increased average clock speed. The proposed adaptive solution requires a specialized Flip-Flop element, to be utilized along critical paths, and a dedicated circuit which generates the stretchable clock signal from four identical-frequency clocks spaced at $90$ degree phase intervals. A $10$\% performance increase is claimed at a $10$\% critical path activation probability.  An analysis of adder architectures in the context of adaptive CS is performed in \cite{ghosh_arithmetic_2008}, and the authors identify adder architectures which benefit most from the adaptive clocking technique. 
 Adaptive CS has also been proposed by the same authors as a work-around for slow paths caused by manufacturing process variability \cite{ghosh_defect_2007}. %preserving yield at a claimed 2\% decrease of average performance for a pipelined processor design. Sorin: It is nor really relevant.
 In \cite{singh_fpga_shift} a clock shifting technique is proposed whereby several phase-shifted clocks are distributed 
 and selected at each Flip-Flop in the FPGA fabric, artificially creating clock skew, which extends the effective clock period for a selected path, at the expense of another path which must have its period reduced in order to absorb the clock skew difference.
-The authors evaluate their proposal on several benchmark circuits and report close to 25\% maximum speed-up. 
+The authors evaluate their proposal on several benchmark circuits and report close to $25$\% maximum speed-up. 
 However the technique has limited applicability as in most practical cases there is not enough available slack to absorb the extra time awarded to the slow path.
--- a/doc/papers/date2015/conclusion.tex
+++ b/doc/papers/date2015/conclusion.tex
-\section{Conclusion and Future Work}\label{conclusion}
+\section{Conclusions}\label{conclusion}
-In this paper we have proposed a hybrid technique for Adaptive Clock Management in FPGA processors, 
+In this paper we proposed a hybrid technique for FPGA processors Adaptive Clock Management (ACM) meant  to harness the latent processor performance in situations where FPGA congestion or lack of embedded resources causes unbalanced paths result in reduced overall operating frequency. We built upon previous work on Clock Stretching (CS) for ASIC circuits and Clock Multiplexing (CM) for FPGA processors. 
-designed to harness the latent processor performance in those situations where FPGA congestion or lack of embedded resources 
+We presented an CS FPGA tailored efficient implementation by utilizing clock multiplexer components 
-causes unbalanced paths to reduce the overall operating frequency. 
+already available in the Xilinx 7-Series FPGA architecture. We evaluated the CS performance and demonstrated that when compared with CM it exhibits lower latency. Our analysis also identified two CS drawbacks, namely low accuracy and reduced dynamic range, which makes CS ACM suboptimal for certain applications. We therefore proposed and evaluated a Hybrid ACM relying on a combination of CS and CM methods. Our Hybrid ACM monitors the processor instruction stream and decides which technique is the most effective given the characteristics of the executing program. We evaluated the Hybrid ACM on 
-Our proposed method builds upon previous work on Clock Stretching in ASIC circuits and Clock Multiplexing (CM) for FPGA processors. 
+traces of the Sum of Squared Differences, Neural Network, and FIR filter algorithms executed on a vector processor mapped on a Xilinx Zynq FPGA and demonstrated a performance increase of up to $14$\% when compared to CM ACM. The Hybrid ACM technique does not require any compile-time optimizations, consumes only $52$ LUTs, one FPGA clock generation block (MMCM), and $6$ FPGA clock multiplexers, and dissipates an additional 100 mW of power, mainly due to the MMCM).  Note that if an MMCM is already utilized 
-We presented an efficient implementation of Clock Stretching (CS) based Adaptive Clock Management (ACM), utilizing clock multiplexer components 
+for FPGA frequency synthesis it can be also utilized by the Hybrid ACM, thus avoiding any power penalty.
-already available in the Xilinx 7-Series FPGA architecture. 
\ No newline at end of file
-We evaluated the performance of CS ACM and demonstrated that in comparison to Clock Multiplexing, the Clock Stretching technique has lower 
-latency.
-Our analysis also identifies two drawbacks of CS ACM, namely low accuracy and reduced dynamic range, which makes CS ACM suboptimal 
-in certain application scenarios.
-We therefore propose and evaluate a hybrid ACM, combining the CS and CM methods. Our hybrid ACM monitors the processor instruction stream 
-and decides which technique is most effective given the characteristics of the executing program. We evaluate the hybrid ACM on 
-traces of the sum of squared differences, neural network and FIR filter algorithms executing on a vector processor mapped to a 
-Xilinx Zynq FPGA and demonstrate a 
-performance increase of up to 14\% compared to CM ACM. Notably, our hybrid ACM technique does not require 
-any compile-time optimizations, and consumes only 52 LUTs, one MMCM and 6 FPGA clock multiplexers.
-The hybrid ACM dissipates an additional 100 mW of power, which is mainly due to the FPGA clock generation block (MMCM). 
-In applications where a MMCM is already utilized 
-for frequency synthesis in the FPGA, the same MMCM may be utilized by the hybrid ACM, thus avoiding the power penalty.
\ No newline at end of file
--- a/doc/papers/date2015/cs_cm_theoretical.tex
+++ b/doc/papers/date2015/cs_cm_theoretical.tex
@@ -18,7 +18,7 @@ The CM execution time $T_{CM}$ is defined in Equation \eqref{clock-mux-time} as
 \begin{align}
 &T_{CM}=N_{FI}T_F+N_{SI}T_S+T_{SW}^{avg}N_{SW}\label{clock-mux-time}
 \end{align}
-In the case of the Clock Stretching (CS) based ACM in \cite{chae_dynamic_2010}, a reference clock period $T_R$ may be on demand extended  by exactly 25\% . 
+In the case of the Clock Stretching (CS) based ACM in \cite{chae_dynamic_2010}, a reference clock period $T_R$ may be on demand extended  by exactly 25\%. 
 In our particular execution performance model, we distinguish two CS use cases. 
 If $T_S$ is less than $1.25*T_F$, then CS may be utilized with $T_F$ as the reference clock period.
 Otherwise, the reference clock period must be longer than $T_F$ such that $T_S$ is equal to $1.25*T_R$.

--- a/doc/papers/date2015/evaluation.tex
+++ b/doc/papers/date2015/evaluation.tex
 \section{Evaluation}\label{evaluation}
-In this section we evaluate the proposed adaptive clocking methodology with regard to performance, resource utilization and power dissipation.
+In this section we evaluate the proposed adaptive clocking methodology with regard to performance, resource utilization, and power dissipation.
 \subsection{Average Clock Switching Time}
-The BUFGCTRL documentation only gives an upper bound on the clock switch time. In order to accurately predict the performance of the Clock Switching 
+Given that the BUFGCTRL documentation only gives an upper bound on the clock switch time, in order to accurately predict the performance of the Clock Switching 
-and Clock Multiplexing strategies in hybrid mode, we evaluate the actual clock switch times in simulation. 
+and Clock Multiplexing strategies in a hybrid framework, we evaluate the actual clock switch times by means of simulations. 
-The experimental methodology is as follows. 
+The experimental methodology is as follows: 
-Two clock signals of period $T_F$ and $T_S$ are generated and connected to the inputs of a BUFGCTRL. 
+Two clock signals of period $T_F$ and $T_S$, initial $T_S$ value is 1\% larger than $T_F$, are generated and connected to the inputs of a BUFGCTRL clock multiplexing buffer. The BUFGCTRL is switched between the two clocks $1000$ times, each time waiting for a random number of clocks, between $1$ and $10$, before performing the next switch.
-The initial value of $T_S$ is 1\% larger than $T_F$.
-The BUFGCTRL switches between the two clocks $1000$ times, each time waiting for a random number of clocks, between 1 and 10, before performing the next switch.
 The duration of the entire simulation is measured and the clock switching overhead is computed. 
 Subsequently, $T_S$ is increased by another 1\% of $T_F$ and the evaluation is repeated. 
-The results of the evaluation are presented in Figure \ref{mux-switch-time}. 
+The evaluation results are presented in Figure \ref{mux-switch-time} when one can observe that the  average clock switch time increases with $T_S$, as expected. 
-The observed average clock switch time increases with $T_S$ as expected. 
 \begin{figure}[!t]
 \centering
@@ -24,30 +21,16 @@ The observed average clock switch time increases with $T_S$ as expected.
 \subsection{Execution Performance}
-In order to establish the relative performance increase compared to the previous work on clock multiplexing, 
+In order to establish the relative performance increase when compared with previous work on clock multiplexing, we evaluate the proposed methodology against the timing parameters of the Vector Processor (VP) in \cite{petrica_vasile_2013} utilizing instruction traces of the Sum of Squared Differences (SSD) algorithm in several variants, as well as a Neural Network (NN) algorithm, and a FIR filter. 
-we evaluate the proposed methodology against the timing parameters of the vector processor in \cite{petrica_vasile_2013} 
+Part of the NN algorithm the VP performs the dot product between the vector of perceptron inputs and the vector of weights. Both SSD and NN are heavily utilized in computer vision \cite{chen2010handbook}, while FIR is essential to many signal processing applications, therefore we can consider this algorithm mix representative of an expected real-world vector processor workload.
-utilizing instruction traces of the sum of squared differences (SSD) algorithm in several variants, as well as a neural network algorithm and a FIR filter. 
-In the neural network (NN) algorithm the vector processor performs the dot product between the vector of perceptron inputs and the vector of weights. 
-Both SSD and NN are heavily utilized in computer vision \cite{chen2010handbook}, while FIR is essential to many signal processing applications, 
-therefore we consider this algorithm mix representative of an expected real-world vector processor workload.
-The target vector processor executes all instructions in a single clock cycle. 
-Multiplication is performed through two instructions, one of which initiates the multiplication, while the other copies either the lower or the 
-upper part of the result into the destination register. 
-Both of these instructions have a latency of $8.5$ ns ($T_S$), while all other instructions have a latency of $6.25$ ns ($T_F$). 
-From these parameters we determine that in Clock Stretching mode the best approximation for $T_S$ is achieved for a 50\% stretch of $T_F$, resulting in 
-$T_S^CS$ of $9.375$ ns. 
-From Figure \ref{mux-switch-time} we are also able to identify the value of $T_{SW}^avg$ as approximately $3.5$ ns. 
-We utilize these values to configure our hybrid ACM.
-We obtained SSD execution traces for the selected algorithms from the authors of \cite{petrica_vasile_2013}, which we 
+The targeted VP executes all instructions in a single clock cycle. Multiplication is performed through two instructions, one initiates the multiplication and the other one copies either the lower or the upper part of the result into the destination register. On a Zynq FPGA VP mapping both multiplication related instructions have a latency of $8.5$ ns ($T_S$), while all the other VP instructions have a latency of $6.25$ ns ($T_F$). 
-utilized to evaluate the execution time for the CS and CM strategies in ISim, a Xilinx FPGA simulation environment. 
+From these parameters we determine that in Clock Stretching mode the best approximation for $T_S$ is achieved for a 50\% stretch of $T_F$, resulting in $T_S^CS$ of $9.375$ ns. 
-We requested and obtained multiple variants of SSD execution traces, with and without loop tiling\cite{xue2000loop}, and tile sizes up to 30 in increments of 5.
+From Figure \ref{mux-switch-time} we are also able to identify the value of $T_{SW}^{avg}$ as approximately $3.5$ ns and we utilize these values to configure the proposed hybrid ACM.
-Increasing the tile size reduces the number of clock switches but maintains the total number of instructions and the instruction mix, therefore 
-evaluating several tiled versions of SSD isolates the effect of $N_{SW}$ on the performance of the ACM system.
+We obtained SSD execution traces for the selected algorithms from the authors of \cite{petrica_vasile_2013} and utilized them to evaluate the execution time for the CS and CM strategies in ISim, a Xilinx FPGA simulation environment.  We exercised multiple variants of SSD execution traces, with and without loop tiling\cite{xue2000loop} and tile sizes up to $30$ with increments of $5$.
-Figure \ref{ssd-results} presents the results of the SSD evaluation for the Clock Multiplexing and Clock Stretching strategies for tile sizes $0$ to $15$. 
+Increasing the tile size reduces the number of clock switches but maintains the total number of instructions and the instruction mix, therefore evaluating several tiled versions of SSD isolates the effect of $N_{SW}$ on the ACM system performance. Figure \ref{ssd-results} presents the results of the SSD evaluation for Clock Multiplexing (CM) and Clock Stretching (CS) strategies for tile sizes $0$ to $15$. 
-The CM execution time decreases with the tile size increase, converging toward the theoretical optimum derived from the circuit timing parameters. 
+The CM execution time decreases when the tile size increase, converging toward the theoretical optimum derived from the circuit timing parameters.  The CS execution time remains constant as expected, and is less than the CM execution time up to a tile size of $5$. We also observe in the Figure that the Hybrid ACM correctly detects the algorithm characteristics and selects the most favorable technique for each  SSD tiling variant.
-The CS execution time remains constant as expected, and is less than the CM execution time up to a tile size of $5$.
-The hybrid ACM is able to correctly detect the characteristics of the algorithm and select the most favorable technique for each tiling variant of SSD.
 \begin{figure}[!t]
 \centering
@@ -56,14 +39,12 @@ The hybrid ACM is able to correctly detect the characteristics of the algorithm
 \label{ssd-results}
 \end{figure}
-The measured SSD results correspond perfectly to the predicted performance for the CM and CS strategies, given the characteristics of the vector 
+The measured SSD results  perfectly correspond to the  CM and CS strategies predicted performance, given the VP characteristics and the SSD algorithm.
-processor and the SSD algorithm.
-Table \ref{tab-cs-vs-cm} summarizes the performance results for all algorithms, listing the best speedup achieved by the hybrid strategy when compared to CM. 
+Table \ref{tab-cs-vs-cm} summarizes the performance results for all algorithms, listing the best speedup achieved by the Hybrid ACM strategy when compared with the CM ACM. 
-For SSD, the best speedup is achieved on the untiled SSD benchmark, where CS improves performance by approximately 11\%. 
+For SSD, the best speedup is achieved on the untiled SSD benchmark, where the CS utilization improves performance by approximately $11$\%. 
-The neural network dot-product based algorithm benefits 14\% from the hybrid approach, while FIR sees only a small 5\% decrease of execution time 
+The NN dot-product based algorithm benefits $14$\% from the hybrid approach, while FIR experiences only a small $5$\% decrease of execution time compared to Clock Multiplexing. 
-compared to Clock Multiplexing. 
+The hybrid ACM never performs worse than the CM ACM because it is able to detect those cases where multiplexing is the best strategy, such as on the tiled SSD benchmark, where there is no speedup.
-The hybrid ACM never performs worse than the clock multiplexing ACM because it is able to detect those cases where multiplexing is the best strategy, 
-such as on the tiled SSD benchmark, where there is no speedup.
 \begin{table}[!t]
 \renewcommand{\arraystretch}{1.3}
@@ -74,7 +55,7 @@ such as on the tiled SSD benchmark, where there is no speedup.
 \hline
 Algorithm & SSD Untiled & SSD Tile 30 & NN & FIR\\
 \hline
-$T_CM$ [ms] & 1.28 & 1.10 & 3.65 & 12.75\\
+$T_{CM}$ [ms] & 1.28 & 1.10 & 3.65 & 12.75\\
 \hline
 $T_{hybrid}$ [ms] & 1.13 & 1.10 & 3.13 & 12.07\\
 \hline
@@ -83,18 +64,14 @@ Speedup & 1.11 & 1 & 1.14 & 1.05\\
 \end{tabular}
 \end{table}
-\subsection{Resource utilization, Maximum Frequency, and Power Dissipation}
+\subsection{Resource Utilization, Maximum Frequency, and Power Dissipation}
 The ACM resource utilization is presented in Table \ref{acm-resources}. 
-The CM ACM requires a single BUFGCTRL, while the CS-ACM and HACM require 4 and 6 BUFGCTRLs respectively. 
+The CM ACM requires a single BUFGCTRL, while the CS ACM and the Hybrid ACM require $4$ and $6$ BUFGCTRLs, respectively. Typical Xilinx 7-Series FPGAs have $32$ or more BUFGCTRLs, therefore we can consider this utilization acceptable. 
-Typical Xilinx 7-Series FPGAs have 32 or more BUFGCTRLs, therefore we consider this utilization acceptable. 
+The Hybrid ACM requires the most logic resources. Of the total hybrid ACM resources, the decision block takes up to $45$ LUTs, $33$ flip-flops, and $3$ DSPs, and can operate at a maximum frequency of $350$ MHz. 
-The hybrid ACM requires the most logic resources. Of the total hybrid ACM resources, the decision block takes up 45 LUTs, 33 flip-flops, and 3 DSPs, 
+If higher frequencies are required, system designers may opt to utilize an CM based ACM. 
-and has a maximum frequency of 350 MHz. 
+Both the CS ACM and Hybrid ACM top operating frequency is fundamentally limited in  by the routing between the control logic and the BUFGCTRLs. The power dissipation was estimated by Xilinx Power Analyzer from the ACM synthesized netlists along with the simulation activity files.
-If higher frequencies are required, system designers may opt to utilize a CM based ACM. 
+Both CS ACM and Hybrid ACM dissipate 100mW of power, mostly due to the MMCM, while for CM ACM, which requires a single BUFGCTRL that has to be utilized anyway for clock buffering, we can consider that it induces zero power overhead.
-Both the CS based ACM and the hybrid ACM are fundamentally limited in top frequency by the routing between the control logic and the BUFGCTRLs.
-The power dissipation was estimated by Xilinx Power Analyzer from the synthesized netlist of the ACM, along with the simulation activity files.
-The entire CS based ACM and hybrid ACM dissipate 100mW of power, mostly due to the MMCM. The CM based ACM only required no MMCM and a single BUFGCTRL, 
-and considering that the BUFGCTRL would have been utilized anyway for clock buffering, we consider the CM based ACM to have zero power overhead.
 \begin{table}[!t]
 \renewcommand{\arraystretch}{1.3}

--- a/doc/papers/date2015/hybrid_acm.tex
+++ b/doc/papers/date2015/hybrid_acm.tex
 \section{FPGA Tailored Hybrid ACM}\label{stretcher}
-Our aim is to provide an Adaptive Clock Manager (ACM) implementable on modern FPGAs and able to operate in conjunction with any FPGA programmable processor, which performance should not rely on code recompilation. 
+Our aim is to provide an Adaptive Clock Manager (ACM) implementable on modern FPGAs and able to operate in conjunction with any FPGA programmable processor, which performance should not rely on code recompilation. This constraint derives from the desire not to increase system complexity, and 
-This constraint derives from the desire not to increase system complexity, and 
+also from the fact that data dependencies prevent optimization on some applications, as demonstrated in~\cite{petrica_vasile_2013}.  Furthermore, recompilation cannot be performed on pre-compiled binaries, which is a common form of deliverable for proprietary software.
-also from the fact that data dependencies prevent optimization on some applications, as previous work has demonstrated. 
-Furthermore, recompilation cannot be performed on pre-compiled binaries, which is a common form of deliverable for proprietary software.
 \subsection{FPGA Clock Stretching}
 We implement a Clock Stretching ACM by multiplexing between $N_{PC}$ out of phase clock signals, 
-$PC_1$ to $PC_N$. 
+$PC_1$ to $PC_{N_{PC}}$. 
-We call these the Phased Clocks (PCs), and note that as much as seven PCs may be generated from a 
+We call these Phased Clocks (PCs), and note that as much as $7$ PCs may be generated from a 
-mixed mode clock manager (MMCM) or phase locked loop (PLL) in the 7-Series FPGA architecture. 
+Mixed Mode Clock Manager (MMCM) or phase locked loop (PLL) in the 7-Series FPGA architecture. 
-As opposed to clock multiplexing between asynchronous clocks, BUFGCTRL-based multiplexing between PCs results in a predictable switch time if the source PC 
+As opposed to clock multiplexing between asynchronous clocks, BUFGCTRL-based multiplexing between PCs results in a predictable switch time if the source PC is ahead of the destination PC, i.e., the falling edge of the destination PC arrives after the falling edge of the source. 
-is ahead of the destination PC, i.e., the falling edge of the destination PC arrives after the falling edge of the source. 
 In this implementation, $T_{SW}$ is always equal to the phase delay between the source and destination PCs, 
 resulting in the controlled stretch of exactly one output period every time the multiplexer selection input changes. 
-Figure \ref{stretcher4} exemplifies a CS ACM, consisting of a MMCM generating the required PCs and a 4-input clock multiplexer with control logic.
+Figure \ref{stretcher4} depicts a CS ACM, consisting of an MMCM generating the required PCs and a $4$-input clock multiplexer with control logic. The DRP and PS ports are utilized to configure the MMCM with the required PC frequencies and phase relationships.
-The DRP and PS ports are utilized to configure the MMCM with the required PC frequencies and phase relationships.
+Equation \eqref{stretch-by-sm} captures the relationship between the output clock period and the selection input, where $D_{PC}(x,y)$ is a function which returns the phase delay of $PC_y$ relative to $PC_x$, and $S_M$ is the value of the multiplexer selection input. 
-Equation \eqref{stretch-by-sm} illustrates the relationship between the output clock period and the selection input, 
-where $D_{PC}(x,y)$ is a function which returns the phase delay of $PC_y$ relative to $PC_x$, and $S_M$ is the value of the multiplexer selection input. 
 \begin{subequations}
 \begin{align}
 S_I,S_M &\in [0,N_{PC})\\
@@ -40,7 +34,7 @@ If the phase difference between PCs is constant and equal to $T_{PC}/N_{PC}$, th
 its output a continuous train of stretched clock periods by incrementing the multiplexer selection in each output period. 
 This is achieved by adding an accumulator to the control path, such that the user controls the rate of selection change, and therefore 
 the period of the output clock, through the $S_I$ input.
-In this mode, the ACM performs the function of a $N_{PC}$-to-1 clock multiplexer without the drawback of the switch time.
+In this mode, the ACM performs the function of an $N_{PC}$-to-1 clock multiplexer without the drawback of the switch time.
 Equation \eqref{stretch-by-si} expresses the relationship between the period of the output clock and the value of the $S_I$ input of the ACM.
 %\begin{figure}[!t]
@@ -50,15 +44,12 @@ Equation \eqref{stretch-by-si} expresses the relationship between the period of
 %\label{histogram}
 %\end{figure}
-Our proposed Clock Stretching implementation has increased dynamic range compared to previous work, as the output period can be as much as $(2-1/N_{PC})*T_{PC}$. 
+The proposed Clock Stretching implementation has increased dynamic range when compared with previous work, as the output period can be as much as $(2-1/N_{PC})*T_{PC}$. 
 Additionally, multiple stretch levels can be achieved, in steps of $(1/N_{PC})*T_{PC}$.
-These properties combined increase the approximation accuracy of the CS.
+These combined properties increase the CS approximation accuracy.
 Theoretically, the clock stretching ACM may be extended to any number of internal PCs, thereby increasing the output frequency precision arbitrarily. 
-However, beyond 4 PCs, restrictions in the routing between BUFGCTRLs on the FPGA fabric do not permit the construction of a balanced multiplexer tree. 
+However, beyond $4$ PCs, restrictions in the routing between the FPGA fabric BUFGCTRLs  do not permit the construction of a balanced multiplexer tree and in an unbalanced multiplexer, some PC inputs pass through more buffers and are delayed in relation to the others, breaking the required phase relationships between PCs. 
-In the unbalanced multiplexer, some PC inputs pass through more buffers and are delayed in relation to the others, breaking the required phase 
+This delay may be compensated for by adjusting the PC phase, but the small expected increase in output frequency precision does not justify the system complexity increase, therefore we decided not follow this avenue any further.  
-relationships between PCs. 
-This delay may be compensated for by adjusting the phase of the PCs, but the small expected increase in output frequency precision does not 
-justify the increase in system complexity, therefore we do not follow this avenue further.  
 %\begin{figure}[!t]
 %\centering
@@ -72,9 +63,8 @@ justify the increase in system complexity, therefore we do not follow this avenu
 For applications which extract more performance from Multiplexing than Stretching, the CS ACM is extended with a hybrid mode, whereby 
 the MMCM is configured to provide an additional clock output $CO_{TS}$ of arbitrary period which is connected to an additional input of the clock multiplexer. 
 In this configuration, Multiplexing or Stretching may be utilized, according to a decision by the ACM control logic. 
-In order to make a decision, the control logic receives as inputs the timing parameters $T_S$, $T_F$, $T_F^{CS}$, $T_S^{CS}$, and $T_{SW}$, 
+In order to make a decision, the control logic takes as inputs the design time determined timing parameters $T_S$, $T_F$, $T_F^{CS}$, $T_S^{CS}$, and $T_{SW}$, 
-and monitors a history of 100 instructions in order to estimate the $N_{SW}$, $N_{SI}$, and $N_{FI}$ parameters of the executing program. 
+and monitors a history of $100$ instructions in order to estimate the $N_{SW}$, $N_{SI}$, and $N_{FI}$ parameters of the executing program. Those parameters are utilized to estimate the execution times corresponding to both ACM strategies utilizing Equations \eqref{clock-mux-time} and \eqref{clock-stretch-time} in order to chose the most appropriate clock manipulation strategy.
-The parameters are utilized to estimate execution times for both ACM strategies utilizing Equations \eqref{clock-mux-time} and \eqref{clock-stretch-time}.
 \begin{figure}[!t]
 \centering
@@ -83,8 +73,6 @@ The parameters are utilized to estimate execution times for both ACM strategies
 \label{hybrid-acm}
 \end{figure}
-Figure \ref{hybrid-acm} presents the implementation of the hybrid ACM in a configuration with four PCs. 
+Figure \ref{hybrid-acm} presents the implementation of a hybrid ACM in a $4$-PC configuration. 
-A six-input multiplexer has been utilized, in order to provide four balanced clock routes for the PCs, on inputs 1, 2, 5 and 6. 
+A $6$-input multiplexer has been utilized, in order to provide four balanced clock routes for the PCs, on inputs $1$, $2$, $5$, and $6$. 
-The clock $CO_{TS}$ may be connected to either of the middle inputs. 
+The clock $CO_{TS}$ may be connected to either of the middle inputs and the remaining multiplexer input is unconnected. No delay compensation is required for $CO_{TS}$ because it has no phase relationship requirement to other multiplexer inputs. 
-The remaining multiplexer input is unconnected.
-No delay compensation is required for $CO_{TS}$ because it has no phase relationship requirement to other multiplexer inputs. 
--- a/doc/papers/date2015/introduction.tex
+++ b/doc/papers/date2015/introduction.tex
@@ -6,6 +6,7 @@ In the coprocessor role, the FPGA may implement a fixed-function or a programmab
 soft processor based solutions are becoming more common, as evidenced by the multitude of soft processor designs which have been proposed in the scientific literature \cite{lysecky2005study,cheah2012lean} as well as by FPGA vendors \cite{microblaze,nios}. 
 Although FPGAs have some advantages over Application-Specific Integrated Circuits (ASICs) with regard to flexibility, lower development costs and shorter time to market, the FPGA processor performance is not only lower than the one of equivalent ASIC circuit in terms of achievable top operating frequency, but it is also more difficult to  control and estimate at design-time~\cite{kuon_fpga_asic}. 
 FPGA logic is implemented in look-up tables (LUTs), which communicate with each-other through a flexible but relatively slow interconnect. 
 While logic delays through LUTs are easily estimated because LUT timing characteristics are known at design time, routing delays depend on the relative placement of interconnected LUTs and the interconnect congestion, which are both unknown before the place and route is not completed. Distant interconnected LUTs increase route delays, while congestion along some interconnect segments may cause route delays to increase further, as some signals have to be diverted along less crowded but more distant segments.  As a result, path delays are naturally less balanced in FPGAs than in ASICs. One usual solution is to force path balancing during placement and routing, which try to make slow paths faster at the expense of making fast one slower.
@@ -19,7 +20,7 @@ While CM based ACM is capable of delivering speedup, when compared to the tradit
 technique, long clock switching delays significantly diminish the performance gains for some benchmarks. 
 In this paper we propose and evaluate a novel hybrid FPGA tailored ACM framework.
-First we propose and evaluate a FPGA tailored Clock Stretching (CS) implementation and compare its potential performance in terms of accuracy, dynamic range, and latency with the ones of the CM counterpart. Our evaluations indicate that (i) CM has good accuracy and large clock frequency dynamic range but it is rather slow, while (ii) CS has limited dynamic range but exhibit a low frequency switching latency.   Based on these we subsequently propose a hybrid ACM which combines the CS and CM advantages, while simultaneously hiding their drawbacks, by monitoring the processor instruction stream and determining which adaptive clock management strategy is optimal at any given time. We evaluate the effectiveness of the proposed ACM solution by simulating the execution of the Sum of Square Differences (SSD) algorithm, a neural network solver, and a FIR filter implemented on an ACM-augmented vector processor mapped on a Zynq FPGA. Our evaluations indicate that the proposed hybrid ACM enables an up to 14\% execution time decrease when compared with clock multiplexing ACM. Moreover the hybrid ACM implementation requires only 52 LUTs and 6 global clock buffers, and dissipates 100 mW while the CM ACM implementation only requires a single clock buffer.
+First we propose and evaluate a FPGA tailored Clock Stretching (CS) implementation and compare its potential performance in terms of accuracy, dynamic range, and latency with the ones of the CM counterpart. Our evaluations indicate that (i) CM has good accuracy and large clock frequency dynamic range but it is rather slow, while (ii) CS has limited dynamic range but exhibit a low frequency switching latency.   Based on these we subsequently propose a hybrid ACM which combines the CS and CM advantages, while simultaneously hiding their drawbacks, by monitoring the processor instruction stream and determining which adaptive clock management strategy is optimal at any given time. We evaluate the effectiveness of the proposed ACM solution by simulating the execution of the Sum of Square Differences (SSD) algorithm, a neural network solver, and a FIR filter implemented on an ACM-augmented vector processor mapped on a Zynq FPGA. Our evaluations indicate that the proposed hybrid ACM enables an up to $14$\% execution time decrease when compared with clock multiplexing ACM. Moreover the hybrid ACM implementation requires only $52$ LUTs and $6$ global clock buffers, and dissipates $100$ mW while the CM ACM implementation only requires a single clock buffer.
 The rest of this paper is structured as follows. 
 Section \ref{prev-work} describes Adaptive Clock Management, Clock Stretching, and Clock Multiplexing, and lists relevant related work. In Section \ref{cs_cm_theoretical} we construct theoretical performance models for Clock Stretching and Multiplexing based ACM, with the purpose of guiding the FPGA implementations of CS based ACM.