Event Name: Live Chat - Make Your Parallel ... - Home - AMD



Event Name: Live Chat - Make Your Parallel Code Run Faster (LPW068569)

File Saved by: Sharon Troia

Date Time: Wed November 8 2006 12:02:35

Sharon Troia: (10:35) Hi, welcome to AMD’s Multi-core text chat!

Sharon Troia: (10:35) My name is Sharon Troia. I’ll be your host for this event. I work at AMD as the managing editor for the tech articles that we provide on Dev Central, developer..

Sharon Troia: (10:35) We will get started at 11am, until then, hang tight :)

Sharon Troia: (10:50) Shortly before we start, we'll do a quick survey.

Sharon Troia: (10:54) You should all be seeing a survey along with the results.

Sharon Troia: (10:56) Wow, looks like we have a pretty even spread of experience here today so far.

Sharon Troia: (10:57) We'll be switching between the agenda slide and the survey for those people who are just joining.

Sharon Troia: (10:59) While people are joining, we are going to start out with a survey to find out what how experienced you are with multithreading.

Sharon Troia: (10:59) Please hold your questions until the survey is complete and we get started

Sharon Troia: (11:00) We are just about ready to get started. We will give participants one or two more minutes to get situated and then we will start the session.

Sharon Troia: (11:00) Please respond to the survey if you haven't already.

Sharon Troia: (11:02) Hi, welcome to AMD’s Multi-core text chat!

Sharon Troia: (11:02) My name is Sharon Troia. I’ll be your host for this event. I work at AMD as the managing editor for the tech articles that we provide on Dev Central, developer..

Sharon Troia: (11:02) I’d like to introduce you to our experts Mike Wall and Richard Finlayson.

Sharon Troia: (11:02) Mike Wall is a Sr. Optimization Engineer with experience in working with software partners on their optimization and multithreading projects. He’ll answer your technical questions about multi-core and multithreading and have some slides for you to reference during this chat session.

Sharon Troia: (11:03) Richard Finlayson is the Director of our Developer Outreach Team and is here to talk about Multicore resources available for you through our developer program on developer..

Sharon Troia: (11:03) One more thing I should mention before we get started - don’t forget we will be conducting a drawing to giveaway an AMD dual-core workstation! Make sure you registered with a valid email address, we’ll do the drawing and notify the winner shortly after the session ends.

Sharon Troia: (11:03) Ok, let’s get started!

Sharon Troia: (11:03) It looks like the majority of people are pretty skilled. Take it away, Mike!

Mike Wall: (11:03) >> The main message is "multi-core is here to stay" and the trend is toward a larger number of cores. So developers who care about performance really need to design scalable multi-threaded code. Scalable multi-threading is the main take-away idea for this chat session.

Mike Wall: (11:04) >> OK, so why hasn't everyone already threaded their performance-critical code? Lay it on me...

Michael Lewis: (11:05) rewriting 300KLOC is not my idea of a fun weekend :-)

David Shust: (11:05) Who says I havn't?

Mike Wall: (11:05) Re: Who says I havn't?

hee hee

Douglas Campbell: (11:05) We have, we're looking for cheaper hardware to do what we use NUMA stuff for now

zirani jean-sylvestre: (11:05) the slide show we are going to have more and more core; do you think developers should plan on writing apps running on > 8 cores ?

Mike Wall: (11:06) you will likely see >8 cores in the coming years, yes

Fabrice Lété: (11:05) I am scared of debugging getting a lot harder

Frank Morales: (11:06) i agree w/ the debugging.

Veikko Eeva: (11:06) Writing multithreaded software with current tools is a rather difficult task. Especially since it's not all the clear how to find the performance critical parts of the code that can easily to be multithreaded.

Stephen Lacy: (11:06) where can I find documentation to answer architectural questions such as the mechanism for cache-to-cache communication and the latencies involved?

Richard Finlayson: (11:07) Re: We have, we're looking for cheaper hardware to do what we use NUMA stuff for now

>>What platform are you on, what plaforms will best meet your needs based on your NUMA comment?

Mike Wall: (11:07) all the major debiggers have some degree of threading support, but yes, it's harder. AMD CodeAnalyst supports some multi-thread perf analysis

zirani jean-sylvestre: (11:07) do you think (co)processors with let say 512 or 1024 cores to be a standard in > 4 years ?

robert marshall: (11:07) i'm still wedded to tules like vtune to show me where i need parallilsm; i go to that section and attack that area of the problem. i want better tools; better languages.

Douglas Campbell: (11:07) SGI. We've actually test ported to an athlon 64 dual core and we are currently seeing better performance that on our SGI equipment.

Mike Wall: (11:08) yes, you need to profile and look for opportunities for data-parallel threading

Richard Finlayson: (11:08) Re: where can I find documentation to answer architectural questions such as the mechanism for cache-to-cache communication and the latencies involved?

>>Stephen, check out . AMD Developer Central resources include:

Documentation, developer tools, product information, AMD64 ecosystem information, technology features & benefits, and in-depth technical documents on all topics related to software development on AMD64.



Mike Wall: (11:09) >> Clean separation of data workloads is critical; don't pass a lot of data between threads. Also, as in single-threaded programming, try to read data only once from memory, and do all relevant processing while it's in cache. Work on small blocks if necessary.

Sharon Troia: (11:09) The slides being shown are only for reference. This is a text based chat event with no audio. All questions will be answered through the text chat window.

David Shust: (11:09) If a cache line is written by one core, is that memory updated in the other cores' caches only if they access the same memory location?

Mike Wall: (11:10) see the slide

Mike Wall: (11:10) there is real sharing, when threads write the same location... and "false sharing"

Veikko Eeva: (11:09) It'd nice to see how the C++0X will implement its threading model. Does anyone have a comment regarding this? Like tool support with C++ and threads. :)

Mike Wall: (11:12) I don't understand the question, sorry

Michael Lewis: (11:09) Can you recommend any strategies for dealing with situations where the majority of operations in a data-parallel model can be done atomically, but some require dependencies?

Mike Wall: (11:09) >> When possible, use libraies that already implement multi-threading. Don't reinvent the wheel (unless you really want to!) Implement data-parallel multi-threading, for best scaling to N cores. See the slides for more details. Test on bleeding-edge hardware, i.e. 4x more cores than your current customers use. Testing and optimizing desktop software on a 2-socket or 4-socket workstation is a really good idea!

Douglas Campbell: (11:09) Shust's question is important -- cache coherency

Mike Wall: (11:12) yes, any time someone writes to a location, other caches get that line flushed if they are storing that data.

Richard Finlayson: (11:10) Re: where can I find documentation to answer architectural questions such as the mechanism for cache-to-cache communication and the latencies involved? Also, check out Architecture Programmers Manuals. A set is avialalbe to anyone who views and particiipates in the Multicore Video Roundtable. The Multicore Video Roundtable includes five video chapters, with a forum feature to allow you to discuss multicore issues. You will find this on the main page at AMD Developer Central

Douglas Campbell: (11:10) ach, the slides are coming too fast!

Stephen Lacy: (11:11) Richard, the programmers manuals I have downloaded have scant references to multi-core. But perhaps there's a set of manuals I have missed?

zirani jean-sylvestre: (11:12) Mike; malloc does contain "hidden" locking does it ? this may have inpact on performances

Richard Finlayson: (11:12) Re: Richard, the programmers manuals I have downloaded have scant references to multi-core. But perhaps there's a set of manuals I have missed? Stephen, updates are published regularly. Multicore content will continue to be added. Updates can be expected at least twice a year.

Dick Dunbar: (11:12) So cache line is 64 bytes ?

Veikko Eeva: (11:12) Or is malloc basically a lock-free algorithm? I could imagine it could be implemented that way.

Douglas Campbell: (11:13) I note cache alignment to avoid cache thrashing -- is there anything to do to prevent cache invalidation between processors -- such as not aligning on a certain number of low order bits as in the SGI case?

paul cheyrou: (11:13) where I can an exhaustive comprehensive "standard rule of threaded programming" where I can point othere developper at ? (my knowledge is from lot of resources and personnal experience), in a team, make people use threaded programming require common basis... so "standard" is what I'm indeed looking for...

Larry Wang: (11:13) Yee, I lost front paret of this chat, and ity's kind fats - can we have the whole session to be replayed somehow later after chat.

Sharon Troia: (11:13) Re: Yee, I lost front paret of this chat, and ity's kind fats - can we have the whole session to be replayed somehow later after chat.

>>We are going to post transcripts - and possibly even a playback of the slides.

Dick Dunbar: (11:13) malloc is system dependent. AIX has a lockfree algorithm ... others do not.

Mike Wall: (11:14) the L1 cache is 2-way associative, so data that is addressed modulo-32K can only live in 2 places... avoid lots of "exactly 32k" blocks

Douglas Campbell: (11:14) Thanks Mike

Mike Wall: (11:15) this is a separate issue from coherency, though... two threads should avoid sharing the same 64-byte cache line whenever possible

Stephen Lacy: (11:15) What is the latency for cache-to-cache communication? I.e., same data in both L1 caches, then one core does a write, and the other core does a read to get the new data.

Mike Wall: (11:17) I don't know the exact number, but it's longer than accessing you own core's cache for sure

Chris MacGregor: (11:15) Mike, can you say more about that? ('so data that is addressed modulo-32K can only live in 2 places... avoid lots of "exactly 32k" blocks')

David Shust: (11:15) If I write 3 sets of 3 mmx registers (temp vars), each set contiguously, starting on 256byte boundaries, but skip the 4th 64byte value of each set, meaning I only write at the beginning of cache lines, but do not fully write the cache lines, do they never go into first level cache? Can these "variables" actually just remain in the processor proper, in the cache line assembling hardware? I only need 3 cache lines for this. All my actual memory output is streaming, bypassing cache.

Sharon Troia: (11:15) Re: re: developer.: I have had a very hard time finding detailed info on the different processors available. Particularly, as I spec systems that will be used for scientific computing, I want to compare things like cache (L1, L2, per-core, total, etc.), cores, core interconnect, clock speed, FSB, etc., but it's a huge pain (if possible at all) to find the info I want. Someone go kick the marketing people - I can't get 'em to buy AMD instead of Intel if they get a clear (misleading or not) story from Intel but confusion from AMD.

>> Check out the Opteron Comparison webiste, this should give you the info you are looking for

Mike Wall: (11:16) look at the optimization guide, on developer. for more info about this

Chris MacGregor: (11:16) Sharon: does the opteron comparison cover the Athlon X2, FX-nn, etc. processors?

Chris MacGregor: (11:16) part of the problem was the difficulty in comparing Athlon XP (at the time) to FX-nn to Opteron, etc.

Sharon Troia: (11:17) Re: Sharon: does the opteron comparison cover the Athlon X2, FX-nn, etc. processors?

>>Yes, you can compare dual-core with opteron

Miguel Pedro: (11:17) What strategy do you recommend for best performance with N threads: let the OS take care of scheduling each thread to a specific code dynamically or should I set the thread affinity myself?

Mike Wall: (11:19) see the slide, you can gain perf in memory-intensive apps using NUMA if you're careful

Veikko Eeva: (11:17) This is something I'd like to know also.

Mike Wall: (11:19) Check the slide, allocating local memory can be done

Mike Wall: (11:18) cache lines are always treated as an "atomic" unit. Streaming stores avoid allocating any cache line,

Richard Finlayson: (11:18) Whiile Mike answers Miguel and Veikko's question, allow me to highlight AMD's Developer Central resources. AMD Developer Central is AMD’s online resource to support and engage software developers of all interests on AMD64. The “jewels in the crown” so to speak, are the depth of technical documentats on a wide variety of development related issues as well as free downloadable tools that enhance performance and allow for straightforward optimization of your applications.

Douglas Campbell: (11:19) Of course, us *nix folk are thrilled by these slides...

Mike Wall: (11:19) sorry 8-/

Mike Wall: (11:20) the *nix support NUMA also

Michael Lewis: (11:20) in my situation I have a lot of data points which are all updated in discrete time steps; usually each point can be updated without accessing any other data point, but there are cases where a point may have to look at its "neighbors" to be updated in a step. Can you recommend any algorithmic (or low-level) strategies for minimizing the cost of these dependent updates?

Stephen Lacy: (11:20) I've had a very difficult time finding latency and performance numbers in AMDs docs. This is a big barrier to investigating the use of multithreading -- you can't construct a good parallelization strategy without knowing the communication overheads. Do the Developer Central resources have concrete numbers on questions such as these (e.g., cache-to-cache latency).

Veikko Eeva: (11:20) Does virtualization have a significant impact on writing multicore aware code?

Sharon Troia: (11:23) Re: Does virtualization have a significant impact on writing multicore aware code?

>>As long as you use the OS methods to determine how many cores are there, then writing mulitcore code should be fine on a virtualization machine

Richard Finlayson: (11:20) You can give us feedback on Dev Central resources via this email address: dev.central@

Douglas Campbell: (11:20) Mike, of course -- where do you think it all started?

Mike Wall: (11:21) can you double-buffer your entire data set, so it's "read only" ?

Dick Dunbar: (11:21) Are there "touch" instructions that allow committed memory, without making the cacheline dirty?

Mike Wall: (11:21) indeed

Mike Wall: (11:21) no

Michael Lewis: (11:21) would be prohibitive - we're talking millions of data points, i.e. close to the 4GB practical addressing limits

Mike Wall: (11:22) work a chunk at a time, special case for overlapping areas?

David Shust: (11:21) Mike: I understand that when I dump my results I will not be using cache lines. So my question is, if I'm just continually using 3 sets of 3 contiguous 64byte values, will these data just stay in the cache lines? Or do they get written and read to the cache? This is for a real time fractal program that recalculates about about 20million triangles per second, per core.

Mike Wall: (11:22) they say in cache line

Dick Dunbar: (11:22) Mike: Was that "no" directed to my "touch" question?

Mike Wall: (11:23) yes, sorry

Michael Lewis: (11:22) sure, but the catch is it is not really feasible to predict where overlaps will occur (hindsight of course is easy)

Larry Wang: (11:23) It looks like that I am a beginner to all these multi-core tech's though recently I purchased two dual core AMD 64-bit Opteron workstations. I would like find out the introductory for parallel processing archtectures/techniques so I may later fully explore its capabilities.

Richard Finlayson: (11:23) Re: It looks like that I am a beginner to all these multi-core tech's though recently I purchased two dual core AMD 64-bit Opteron workstations. I would like find out the introductory for parallel processing archtectures/techniques so I may later fully explore its capabilities.

Miguel Pedro: (11:23) Thanks Mike

Michael Lewis: (11:23) I realize the case is a bit vague but I'm casting around for some kind of lock-free method that can chunk through the main cases, and then queue the collisions for special processing later; but this may be a language-level problem rather than architecture (i.e. backing out when we detect collision is not expressible in C++)

Mike Wall: (11:25) we don't have any special HW help for you

robert marshall: (11:24) I design commodity clusters for HP scientific Grand Challenge apps such as Weather Forcasting. Multi-cores mean cheaper clusters, but my users are used to thinking in terms of FORTRAN loops to match cache strides and having MPI abstract away their inter processor communications. The jump to threading is an architectural leap that I will somehow have to solve for them unless new tools or libraries or languages extensions ie to FORTRAN will come about.

Mike Wall: (11:26) OpenMP can help

Mike Wall: (11:24) I don't know enough about your app to answer

Sharon Troia: (11:25) Robin Maffeo has also joined us as a multhreading expert

Richard Finlayson: (11:25) Re: It looks like that I am a beginner to all these multi-core tech's though recently I purchased two dual core AMD 64-bit Opteron workstations. I would like find out the introductory for parallel processing archtectures/techniques so I may later fully explore its capabilities.

>>Larry, start out with the MultiCore Resource Center, then go from there. The Multicore Resource Center includes articles, related links, and multicore development related tools.



Also, check out the The Multicore Video Roundtable includes five video chapters, with a forum feature to allow you to discuss multicore issues. You will find this on the main page at AMD Developer Central

Sharon Troia: (11:25) Thanks Robin! Please feel free to start answering questions

Dick Dunbar: (11:26) Robert: The "search for parallel content" is handled by the compilers .... yes? And threading is handled under the covers by the Fortran pragmas. Maybe I didn't understand your question ... it should be happening automagically for your users moving code from one platform to AMD. Of course, the AMD optimization may come a little late.

Mike Wall: (11:27) OpenMP is supported in Fortran- automatic threading of heavy data loops

David Shust: (11:27) I'd like to mention that I see a 98% speed increase in my fractal app using two cores vs one on an X2 4800. I had thought my calculations were memory bound. But double the speed with a shared memory controller must mean otherwise. Doesn't it?

Mike Wall: (11:27) you must be running in cache- you have separate cache per core

Róbert Kovács: (11:27) Do you think that OpenMP can be useful in game development, where the chunks would run in 1-2 ms at maximum?

Mike Wall: (11:28) overhead may be too much

Stephen Lacy: (11:27) Mike, can you briefly contrast AMD's multicore architecture to Intel's new "Core" architecture? What advantages does AMD's multicore solution offer over Intels?

Larry Wang: (11:27) Thanks, Richard, I'll. How do I copy the informations from chat text instead of hadn-written down?

Sharon Troia: (11:28) Re: Thanks, Richard, I'll. How do I copy the informations from chat text instead of hadn-written down?

>> we will be posting these transcripts online in a day or two

robert marshall: (11:28) If the users can undersstand the GCC optimizations switches and move their code from F77 to F90 to F95 to F2003 they have a prayer of getting the optimizations; but the threading support;t not there in GCC yet.

David Chait: (11:28) I'm seeing these slides as related to game development... but not sure the 'direction' to take? Run a bunch of similar threads of processing simultaneous? Try to break up the distinct functions and mutex/bottleneck at transition points? OpenMP??

Mike Wall: (11:29) Re: I'm seeing these slides as related to game development... but not sure the 'direction' to take? Run a bunch of similar threads of processing simultaneous? Try to break up the distinct functions and mutex/bottleneck at transition points? OpenMP??

>> Functional threading (or "task-parallel") can be a relatively simple start, but it doesn't scale up very far (see slides). You really need some form of data-parallel threading to scale to 4, 8, 16 core processors and beyond... which you will probably see during the lifespan of applications currently under development.

Tomasz Ciamulski: (11:29) Memory-intensive apps are very common, but cannot have good performance using cores. How do you see the NUMA presence in contemporary PCs to have the NUMA machines massively attainable? AMD has good NUMA solutions, but used mainly in servers - so far.

Mike Wall: (11:31) Re: Memory-intensive apps are very common, but cannot have good performance using cores. How do you see the NUMA presence in contemporary PCs to have the NUMA machines massively attainable? AMD has good NUMA solutions, but used mainly in servers - so far.

>>NUMA is coming to the desktop - we are launching 2-socket desktop machines, code name "4x4" this month

David Shust: (11:29) Mike: but each core is doing cache-bypassing writes to memory, generating huge amounts of data. The shared memory controller somehow keeps up.

Mike Wall: (11:30) Re: Mike: but each core is doing cache-bypassing writes to memory, generating huge amounts of data. The shared memory controller somehow keeps up.

>> nice to hear 8-)

Scott Paine: (11:29) Re: not there in GCC yet-- It's worth mentioning that RedHat has back-ported OpenMP into gcc4.1 on Fedora-- works great.

Mike Wall: (11:31) Re: Re: not there in GCC yet-- It's worth mentioning that RedHat has back-ported OpenMP into gcc4.1 on Fedora-- works great.

>> thanks, Scott!

Dick Dunbar: (11:30) Larry Wang: Your new shiny 64bit AMD. What OS are you using. If Windows, make sure you use at least WinXP64 ... windows "JUMPS" to attention on the AMD boxes. Running things on vmware with Opteron ... I see superscalar performance running 32-bit Win/Solaris/Linux on the Opterons. Nothing close to that using the Intel hyperthreaded chips.

David Chait: (11:31) So, for data-parallel, is the best approach OpenMP/compiler, or create a ton of 'worker threads' that can do general work, or a set of threads per functional area and round-robin through them from the 'primary' process thread?

Mike Wall: (11:32) Re: So, for data-parallel, is the best approach OpenMP/compiler, or create a ton of 'worker threads' that can do general work, or a set of threads per functional area and round-robin through them from the 'primary' process thread?

>> Compilers that support OpenMP can help you implement data-parallel threading. Visual Studio 2005, various gcc versions. Java implements its own multi-threading, behind the scenes.

Mike Wall: (11:32) Re: So, for data-parallel, is the best approach OpenMP/compiler, or create a ton of 'worker threads' that can do general work, or a set of threads per functional area and round-robin through them from the 'primary' process thread?

OpenMP makes its own thread pool

Philip Borghesani: (11:31) One problem we have with OpenMP is dynamic problem scailing and deciding at what data size it is appropreate to thread the calculation and when we are better of on one processor. Any comments?

Mike Wall: (11:33) Re: One problem we have with OpenMP is dynamic problem scailing and deciding at what data size it is appropreate to thread the calculation and when we are better of on one processor. Any comments?

Good question. It's gotta be empirical testing, I think.

Douglas Campbell: (11:31) Mike's point about data being the lens is essential to proper multi-threading -- you protect data, not code.

Robin Maffeo: (11:32) Re: Mike, can you briefly contrast AMD's multicore architecture to Intel's new "Core" architecture? What advantages does AMD's multicore solution offer over Intels?

>> AMD's multi-core architecture coming next year is a true, native dual core design. Intel's 4 core design throws two chips on a package, interconnected at the front-side bus. This can be a drawback since some applications will saturate the bus. In addition, the cores are not symmetric (two are sharing a cache each)

Róbert Kovács: (11:32) What is the main direction of your developments? Should the developers expect symmetric core architectures or assymetric in the next few years?

Robin Maffeo: (11:33) Re: What is the main direction of your developments? Should the developers expect symmetric core architectures or assymetric in the next few years?

>> symmetric in the near/medium term

Stephen Lacy: (11:32) Given that AMD's memory controller is dual-channel, can one assign an off-chip memory channel to each core? Or must the two channels be used together as a single wide channel?

Mike Wall: (11:34) Re: Given that AMD's memory controller is dual-channel, can one assign an off-chip memory channel to each core? Or must the two channels be used together as a single wide channel?

They work in sync, you have no control

Dick Dunbar: (11:33) "4x4" ... come with a towing package and snow tires? :-)

Mike Wall: (11:34) Re: "4x4" ... come with a towing package and snow tires? :-)

>>and some chewing tobacco

Miguel Pedro: (11:34) Practical question: for large memcpys, is there any advantage in splitting the buffer size and using multiple cores? If so, how large would the buffer have to be to offset the overhead to fork/join the threads?

Mike Wall: (11:34) Re: Practical question: for large memcpys, is there any advantage in splitting the buffer size and using multiple cores? If so, how large would the buffer have to be to offset the overhead to fork/join the threads?

>>no, just use an optimized copy on a single core

Douglas Campbell: (11:34) Robin, doesn't the os make the cores assymetrical? I note that one of my cpus handles all the interrupts under RHEL4

Robin Maffeo: (11:35) Re: Robin, doesn't the os make the cores assymetrical? I note that one of my cpus handles all the interrupts under RHEL4

>> ah. from that perspective yes. Depending on OS version interrupt routines will be bound to a particular core.

David Shust: (11:35) OpenMP is slightly less fast than hand coded thread pools, but it's easier, if you cannot endure writing the threading/exclusion code.

Mike Wall: (11:35) Re: OpenMP is slightly less fast than hand coded thread pools, but it's easier, if you cannot endure writing the threading/exclusion code.

yup!

Mike Wall: (11:35) >> When possible, use libraies that already implement multi-threading. Don't reinvent the wheel (unless you really want to!) Implement data-parallel multi-threading, for best scaling to N cores. See the slides for more details. Test on bleeding-edge hardware, i.e. 4x more cores than your current customers use. Testing and optimizing desktop software on a 2-socket or 4-socket workstation is a really good idea!

Philip Borghesani: (11:35) And I can ship a product for all platfroms developed with emperical testing? We get widely different cutoffs for AMD and Intel platforms

Richard Finlayson: (11:35) Check out free downloadable Software Development tools at



Tools include AMD AMD CodeAnalystâ„¢ Performance Analyzer, AMD Core Math Library (ACML), and AMD SimNow simulator.

Andre Nogueira: (11:35) Which of these do you think will win in the future (ie which one will be used by the average joe) after weighting performance, cost, and energy consumption - machines with one core and one processor, multi-processor machines, one processor with several cores, or several processors with several cores each? And why?

Richard Finlayson: (11:36) The AMD CodeAnalyst™ Performance Analyzer is a suite of powerful tools that analyzes software performance on AMD microprocessors. These tools are designed to support Microsoft® Windows® 2000 or Microsoft Windows XP® distributions on x86 and AMD64 architectures as well as both 32-bit and 64-bit Linux distributions based around the 2.4 or 2.6 kernel series on x86 based architecture. Although most users will choose the Graphical User Interface, the profiler is also offered as a command line utility to facilitate the use in batch files.

Mike Wall: (11:36) Re: And I can ship a product for all platfroms developed with emperical testing? We get widely different cutoffs for AMD and Intel platforms

>>yah, that is an issue... can you implement a fast benchmark at start-up?

Dick Dunbar: (11:36) Philip: data size determination ... Yup ... that's why it's an art, not a science :-) Alan Karp had a running "bet" with the industry to develop apps that would provide positive scaling. He went a long time without having to pay up. Eventually, we got some Fortran vector/parallel chemistry apps to actually go super-scalar. It's been awhile since I was in that game.

Richard Finlayson: (11:36) The AMD Core Math Library (ACML) incorporates BLAS, LAPACK and FFT routines, as well as vector math libraries, which are designed to be used by a wide range of software developers to obtain excellent performance from their applications running on AMD platforms. The highly optimized library contains numeric functions for mathematical, engineering, scientific and financial applications. ACML is available both as a 32-bit library, for compatibility with legacy x86 applications, and as a 64-bit library that is designed to fully exploit the large memory space and improved performance offered by the latest AMD64 architectures. ACML is supported on both Linux and Microsoft® Windows® operating systems.

Douglas Campbell: (11:36) Mike, the o'reilly pthreads book is a great starter for thread pooling code -- but there is a bug in their implementation, read the errata.

Mike Wall: (11:37) Re: Mike, the o'reilly pthreads book is a great starter for thread pooling code -- but there is a bug in their implementation, read the errata.

Thanks for the reference!

Mike Wall: (11:36) Re: Which of these do you think will win in the future (ie which one will be used by the average joe) after weighting performance, cost, and energy consumption - machines with one core and one processor, multi-processor machines, one processor with several cores, or several processors with several cores each? And why?

>>single-core processors are going away. So it will be multi-core, with multi-socket at the high end...

David Shust: (11:37) For absolute tweaking, I found CodeAnalyst to be deficient. That or my routines are beyond improvement. Wishful thinking, that.

Robin Maffeo: (11:37) Re: For absolute tweaking, I found CodeAnalyst to be deficient. That or my routines are beyond improvement. Wishful thinking, that.

>> can you elaborate?

Sharon Troia: (11:37) Next survey coming up...

Larry Wang: (11:37) Dick, I am running LINUX FEDORA OS for my Opteron but haven't utilize any 64-bit features for my gigantic physics code calculations - I even have (probably compatible) problems between OS and FORTRAN compiler (f77 through f95 versions), need help out this hassel like FX debugger with IMSL library with compilers linkage.

Philip Borghesani: (11:37) Not realisticly we are talking about MATLAB

Stephen Lacy: (11:38) Does AMD multicore provide any kind of core-to-core communication mechanism other than shared memory (i.e., communication through the caches)? I am thinking of some kind of direct messaging that does not involve the coherence protocol or shared memory.

Mike Wall: (11:38) Re: Does AMD multicore provide any kind of core-to-core communication mechanism other than shared memory (i.e., communication through the caches)? I am thinking of some kind of direct messaging that does not involve the coherence protocol or shared memory.

>>coherence is always enforced, except (temporarily) when using streaming store

Mike Wall: (11:39) there is a data path on-chip for passing data between caches

Dick Dunbar: (11:39) > 4x more cores than your customers use .... that is REALLY sensible advice. Our developers routinely use uniproc or 2-way Win boxes, and cannot understand how there code might fail in a 16-32 way box.

Mike Wall: (11:40) Re: > 4x more cores than your customers use .... that is REALLY sensible advice. Our developers routinely use uniproc or 2-way Win boxes, and cannot understand how there code might fail in a 16-32 way box.

>>future-proofing today means aggressive multi-core testing... more cores are coming soon!

Róbert Kovács: (11:40) data path: which api does support it on windows?

Mike Wall: (11:41) Re: data path: which api does support it on windows?

it is used automatically by the CPU to maintain coherency

David Chait: (11:40) It would seem that especially where (most) game developers have to span a wide range of cpu's, single and multi core, multiple OEMs, reliance on OpenMP to 'get things right' is dangerous, vs home-grown solution. At the same time, if OpenMP is supported on OSX+linux (gcc) and Windows (gcc, VS05), it makes for an easier 'jump start' into parallel processing. Any advice of how to learn, how to test different approaches?

Mike Wall: (11:42) Re: It would seem that especially where (most) game developers have to span a wide range of cpu's, single and multi core, multiple OEMs, reliance on OpenMP to 'get things right' is dangerous, vs home-grown solution. At the same time, if OpenMP is supported on OSX+linux (gcc) and Windows (gcc, VS05), it makes for an easier 'jump start' into parallel processing. Any advice of how to learn, how to test different approaches?

>> is good. MSDN has some simple example. Use OpenMP at a coarse granularity so the overhead is tolerable.

Larry Gewax: (11:41) How many cores do you see on a chip by 2011?

Mike Wall: (11:42) Re: How many cores do you see on a chip by 2011?

I dunno.... more than 4, that's-a-for-sure my friend

Joseph Osiecki: (11:41) So we've been begging for multi core/processors for 20+ years and now you are going to give it to us with a vengeance.

Mike Wall: (11:43) Re: So we've been begging for multi core/processors for 20+ years and now you are going to give it to us with a vengeance.

>>yup. it's too hard to extract more single-thread performance, so it's all about multi-core going forward

Róbert Kovács: (11:42) and can it be so fast than I would optimize for it "by hand"

Mike Wall: (11:44) Re: and can it be so fast than I would optimize for it "by hand"

>>there is no software visibility of the cache data sharing hardware

zirani jean-sylvestre: (11:42) If I am developing an application to be shiped in 3 years; does it make sense to make it optimised for > 8 core ?

Robin Maffeo: (11:42) Re: If I am developing an application to be shiped in 3 years; does it make sense to make it optimised for > 8 core ?

>> absolutely, yes.

Veikko Eeva: (11:42) Then there's something like Boost.MPI also...

Dick Dunbar: (11:42) > Multi-core, multi-socket .... I had "sticker shock" at the cost of ECC memory for the 2-slot Opterons, compared to the Athlon dual core. Can you talk about the memory requirements of current products ... and how you see that changing? i.e., can we use "commodity" memory?

Mike Wall: (11:44) Re: > Multi-core, multi-socket .... I had "sticker shock" at the cost of ECC memory for the 2-slot Opterons, compared to the Athlon dual core. Can you talk about the memory requirements of current products ... and how you see that changing? i.e., can we use "commodity" memory?

>>the new "4x4" support commodity memory on a 2-socket machine

David Shust: (11:42) If you can correctly code multithreading via semaphore, atomic increment/decrement, mutual exclusions, etc., and your code actually works, I feel that will provide a better understanding of multithreading/multicore than would a hopeful depedence on OpenMP. Plus, you still have to lay out your code to be multithreadable via OpenMP.

Mike Wall: (11:45) Re: If you can correctly code multithreading via semaphore, atomic increment/decrement, mutual exclusions, etc., and your code actually works, I feel that will provide a better understanding of multithreading/multicore than would a hopeful depedence on OpenMP. Plus, you still have to lay out your code to be multithreadable via OpenMP.

>>very true

Larry Gewax: (11:42) Intel says 80.

Michael Lewis: (11:43) Does anyone have some favourite resources for lock-free/lock-minimal architecture design? Internet is preferable but I'm not against dead-tree if it's good

Robin Maffeo: (11:45) Re: Does anyone have some favourite resources for lock-free/lock-minimal architecture design? Internet is preferable but I'm not against dead-tree if it's good

>> look for lock free on wikipedia -- it has some links. This is hard to get right, though.

Frank Greco: (11:43) What about Java developers?

Douglas Campbell: (11:44) Larry, why would Intel not state core counts in powers of two? Sounds like weird architecture.

Frank Morales: (11:44) being new to 'parallel programming' are there tools available to 'simulate' multi-core machines, for those of us who don't have access to a physical box? Or who want to write code that scales well w/ N-cores\CPUs.

Dick Dunbar: (11:44) > Larry: Not utilized 64bit features. It is important to get the OS running 64-bit, even if you don't exploit the memory with 64-bit apps. The AMD chips are MUCH faster in 64-bit because of the extra registers and such. I was pleasantly surprised. Usually, doubling the data widths causes slow-downs ... not on modern hardware.

Veikko Eeva: (11:44) Michael, I don't think there's that much resources on lock-free algorithms and such. My feeling is that every lock-free algorithm is still a worth of a publication. :)

Douglas Campbell: (11:45) Dick, I got a 20% improvement on my 32-bit algorithms under gcc just by compiling for 64-bit as opposed to 32-bit.

Stephen Lacy: (11:45) Mike, does probing of caches to enforce coherence create a lot of contention between the probe and accesses from the cache's owning processor? I assume both the L2 and L1 are probed simultaneously on one core when the other core does a write to shared memory?

Michael Lewis: (11:45) thanks Veikko, Robin

Andre Nogueira: (11:46) For those who are experienced "multi-core developers", what advice would you give to those just now starting to look into the area (like me), and what were your biggest problems when you started developing for multi-core?

Mike Wall: (11:48) Re: For those who are experienced "multi-core developers", what advice would you give to those just now starting to look into the area (like me), and what were your biggest problems when you started developing for multi-core?

>>look at the new developer. multi-core resource center, for starters

Dick Dunbar: (11:46) > Radio box survey: I agree. I had to check Sun compiler to acknowledge how good they are. For corporate use, I'm saddled with Visual Studio

Douglas Campbell: (11:46) Robin, look at the postgresql site and examine their tree traversal algorithms; they have to minimize lock contention between multiple readers and writers.

Robin Maffeo: (11:47) Re: Robin, look at the postgresql site and examine their tree traversal algorithms; they have to minimize lock contention between multiple readers and writers.

>> I'll check it out; I'm assuming they have some sort of reader writer lock with compare/swap

Dick Dunbar: (11:47) > Our primary platform ... is CROSS platform. Commercial software has to hedge their bets.

Mike Wall: (11:47) Re: Mike, does probing of caches to enforce coherence create a lot of contention between the probe and accesses from the cache's owning processor? I assume both the L2 and L1 are probed simultaneously on one core when the other core does a write to shared memory?

>>I don't have details on the timing, but it's done quite efficiently on chip

Kim Rowe: (11:47) what about emulator support - and how will it change as the number of processors on a core increases?

Sharon Troia: (11:47) For everyone who doesn't have access to multi-core or 64-bit systems, please check out our Developer Center. The AMD Developer Center offers a fully configured AMD64 technology-based development environment to develop, test and optimize your applications. Located at AMD headquarters in Sunnyvale, California, the AMD Developer Center provides both on-site technical support as well as global virtual access to a 64-bit development environment with a high level of security.

David Shust: (11:47) Andre, buy a senior year Operating Systems text at your local university book store. It will explain what you need to know.

Larry Gewax: (11:47) If I use OpenMP will it optimize the thread count to the system the app runs on?

Robin Maffeo: (11:49) Re: If I use OpenMP will it optimize the thread count to the system the app runs on?

>> It depends on the OpenMP runtime implementation. On Windows, it will use the OS threadpool and will attempt to scale to the number of processors

Veikko Eeva: (11:48) Andre. there's a lot of information in academia, but it's concerning bi-similarity, temporal logic, turing machines etc. and less on how to make that work practically. Of course, there are some books on threading etc. but the resources are still somewhat scarce. Also, no really language support in the most commonly used languages (like in Erlang etc.)

Stephen Lacy: (11:48) Mike, did you ever work for Synaptics? (sorry for the brief excursion off topic)

Mike Wall: (11:49) Re: Mike, did you ever work for Synaptics? (sorry for the brief excursion off topic)

>>yup

Douglas Campbell: (11:48) Andre, my biggest problem was shifting from code-centric thinking to data centric thinking. As Mike pointed out before, all those mutexes/mutices and stuff protect data, not code.

Joseph Osiecki: (11:48) We're probably going to top out at 2 heavy threads and a light thread. At least for a while. It's easy to say "parallel data" but that's not the way most people think. Does that mean that the other cores will just sit there? Is that why you have independent power for each core?

Mike Wall: (11:50) Re: We're probably going to top out at 2 heavy threads and a light thread. At least for a while. It's easy to say "parallel data" but that's not the way most people think. Does that mean that the other cores will just sit there? Is that why you have independent power for each core?

>>cores will be idle if you cannot give them work. Power management will reduce their power usage somewhat

Andre Nogueira: (11:49) Ok, thank you for the answers guys! :)

Fabrice Lété: (11:49) do you recommend some stategy for test-stressing multithreaded applications for thread safety ?

Mike Wall: (11:51) Re: do you recommend some stategy for test-stressing multithreaded applications for thread safety ?

>>just try all the workloads and system configs you can think of. And for Windows, you can use the /numproc=n trick in boot.ini

Robin Maffeo: (11:51) Re: Re: do you recommend some stategy for test-stressing multithreaded applications for thread safety ?

just try all the workloads and system configs you can think of. And for Windows, you can use the /numproc=n trick in boot.ini

>> also use the largest, fastest machine you can find. Sometimes race conditions won't appear on some platforms, so try to vary your testing.

Miguel Pedro: (11:49) Sharon, what are the costs to use global virtual access?

Sharon Troia: (11:51) Re: Sharon, what are the costs to use global virtual access?

>> It is free

Sharon Troia: (11:50) With only a few minutes left, we will start wrapping up the existing questions and can take one or two more. If you have questions after this session you can submit them to dev.central@.

Dick Dunbar: (11:50) > OpenMP -vs- HomeGrown. Good question. The important thing is to write your code with parallel constructs ... make the parallel content impossible to miss. Build on OpenMP, check your code, tweak, lather, repeat. Of course, some languages make it very easy to code parallel thoughts directly.

Mike Wall: (11:51) Re: > OpenMP -vs- HomeGrown. Good question. The important thing is to write your code with parallel constructs ... make the parallel content impossible to miss. Build on OpenMP, check your code, tweak, lather, repeat. Of course, some languages make it very easy to code parallel thoughts directly.

>>design from day#1 for parallel algorithms and try to minimize dependencies

Douglas Campbell: (11:51) Fabrice, to "fuzz test" your threads, you need to have a lot of interlocked data moving through them simultaneously.

Larry Gewax: (11:52) What are the main differences between coding for a multi-core system vs. a multiprocessor system? vs. say a muti-core multiprocessor system?

Veikko Eeva: (11:52) A good question.

Mike Wall: (11:52) >> AMD is co-developing high performance tools with PGI . Version 6.2 is available now, version 7.0 will arrive in Q4. Here are some key features that the tools have in support of multi-thread/core/processor piler features Highly optimized code for AMD64 based systems, Auto vectorization for SIMD code generation, Auto parallelization for Multicore and Multiprocessor Systems, OpenMp for directive driven threading/parallelization of code, MPI for applications that take advantage of distributed computing platforms like clusters, Linux and Windows support. The Debugger and Profiler: Supports native thread debugging, OpenMp and MPI debugging / profiling, Linux and Windows

Dick Dunbar: (11:53) Scale Up ... or Scale Out. Would you talk a bit more about the AMD Numa. I wasn't aware it was there. We've already seen the light at the end of the tunnel (speed of light), so it's pretty obvious if we want to solve WorldClass problems, we have to scale OUT. Message passing, Numa, not cache memory. Those disciplines will be rewarded by the AMD processor pipelines. Where can I find the AMD Numa story?

Sharon Troia: (11:53) One last survey while we are wrapping up here...

Kim Rowe: (11:53) Are their good tools to allow cost based scheduling to allocate threads to processors? If so with what restrictions?

Fabrice Lété: (11:53) any trick to force frequent random context switches for testing ?

Mike Wall: (11:54) Re: Scale Up ... or Scale Out. Would you talk a bit more about the AMD Numa. I wasn't aware it was there. We've already seen the light at the end of the tunnel (speed of light), so it's pretty obvious if we want to solve WorldClass problems, we have to scale OUT. Message passing, Numa, not cache memory. Those disciplines will be rewarded by the AMD processor pipelines. Where can I find the AMD Numa story?

>>Each CPU socket has its own memory controller, so as you add sockets to the system... you add memory bandwidth

zirani jean-sylvestre: (11:54) fab: I would say put sleep(.) in your algorithm

Stephen Lacy: (11:54) Developer Central seems to only discuss multi-core programming using either Windows or Linux libraries. I am interested in machine-level techniques for managing threads. Where would I find this level of discussion. Another way to phrase the question ... where would I get architectural documentation if I am writing my own thread library with minimal or no OS?

Mike Wall: (11:54) Re: Developer Central seems to only discuss multi-core programming using either Windows or Linux libraries. I am interested in machine-level techniques for managing threads. Where would I find this level of discussion. Another way to phrase the question ... where would I get architectural documentation if I am writing my own thread library with minimal or no OS?

>>look at the Optimization Guide, and the Bios and Kernel Developer's Guide

Sharon Troia: (11:54) According to the survey it looks like this was somewhat helpful to most people. Please send us suggestions for improvement to dev.central@

Douglas Campbell: (11:54) Larry, there is the concept of "slow memory" and "fast memory"; depending on the architecture, you may have to place your datastructures "closer" to your processor. This is similar to the cache coherency problem for a dual-core.

Mike Wall: (11:55) Re: Larry, there is the concept of "slow memory" and "fast memory"; depending on the architecture, you may have to place your datastructures "closer" to your processor. This is similar to the cache coherency problem for a dual-core.

>>there are simple things you can do, to assure the data is allocated locally

Robin Maffeo: (11:54) Re: What are the main differences between coding for a multi-core system vs. a multiprocessor system? vs. say a muti-core multiprocessor system?

>> the differences are minimal, but keep in mind differences from a NUMA standpoint. With multiple cores you will have more cores on a single NUMA node. In addition, you may have cache differences to consider (shared or not). Im general though, unless you are very close to the hardware then there is no appreciable difference.

Philip King: (11:56) Thanks, to AMD, the moderators, and fellow participants for the questions and answers presented here today. As a novice, this was actually well beyond my understanding but did provide a rich set of 'tracks' to follow.

Robin Maffeo: (11:57) Re: Thanks, to AMD, the moderators, and fellow participants for the questions and answers presented here today. As a novice, this was actually well beyond my understanding but did provide a rich set of 'tracks' to follow.

>> Thanks Phil! I hope it was somewhat useful, even if over your head at times.

David Shust: (11:56) Mike, elaborate about local data.

Douglas Campbell: (11:56) Mike, unclear how to do what you say under RHEL4. SGI IRIX "migrates" the data into the memory NUMArically closest to the processor, but I don't see the equivalent under RHEL4

Mike Wall: (11:58) Re: Mike, unclear how to do what you say under RHEL4. SGI IRIX "migrates" the data into the memory NUMArically closest to the processor, but I don't see the equivalent under RHEL4

>>You need to set affinity mask, then allocate from the heap from within your thread proc... Windows knows which RAM is local

Mike Wall: (11:56) Re: Mike, elaborate about local data.

In Windows, you can set thread affinity to a NUMA node, then just malloc your local buffer. Windows uses the local RAM bank.

Paul Nolte: (11:57) For more information:

Mike Wall: (11:57) Re: Mike, elaborate about local data.

Linux/Unix also support similar memory affinity

Stephen Lacy: (11:57) For next time, I would suggest a format where questions are pre-submitted. Then answers can be crafted for questions which fall into the same thread. Might make this process a lot more efficient. Can be augmented by free-form Q&A.

Sharon Troia: (11:58) Re: For next time, I would suggest a format where questions are pre-submitted. Then answers can be crafted for questions which fall into the same thread. Might make this process a lot more efficient. Can be augmented by free-form Q&A.

>> thanks for the suggestion, we will look into it for sure

Jim Dodd to Sharon Troia: (11:57) Mike & Robin, is porting between multi-core AMD64 or UltraSpark T1 or IBM's Cell complicated because of memory design differences or mostly other factors?

Dick Dunbar: (11:57) Re: LockFree .... Everybody has their patent-hat on. There isn't that much sharing. AMD could make a very positive contribution by contributing lock-free algorithm libraries that support their hardware. We shouldn't all be inventing this stuff ... ok, "invent" is a loaded word. LockFree has been around since 1973.

Robin Maffeo: (11:58) Re: Re: LockFree .... Everybody has their patent-hat on. There isn't that much sharing. AMD could make a very positive contribution by contributing lock-free algorithm libraries that support their hardware. We shouldn't all be inventing this stuff ... ok, "invent" is a loaded word. LockFree has been around since 1973.

>>What type of lock free algorithms would you like to see? Standard stuff like queues, or more elaborate?

Michael Lewis: (11:58) Dick: couldn't agree more

Jim Dodd: (11:58) Mike & Robin, is porting between muli-core AMD64 or UltraSpark T1 or IBM's Cell complicated because of memory design differences or mostly other factors?

Mike Wall: (11:59) Re: Mike & Robin, is porting between muli-core AMD64 or UltraSpark T1 or IBM's Cell complicated because of memory design differences or mostly other factors?

>>I can't speak from experience there, sorry

Mike Wall: (11:59) Re: Mike & Robin, is porting between muli-core AMD64 or UltraSpark T1 or IBM's Cell complicated because of memory design differences or mostly other factors?

>>but especially CELL is very different- non-coherent

Dick Dunbar: (11:58) Doug; 20% Improvement ... that's great news. Very likely from better machine scheduling to exploit the 64-bit registers. Thanks.

Sharon Troia: (11:59) I want to thank everyone for participating in this live chat! For more information and tutorials please visit our Multicore resource center .

Sharon Troia: (11:59) For those of you who entered your email address you will be entered for a chance to win a dual-core AMD workstation. We’ll notify you shortly by email. Good luck everyone!

Veikko Eeva: (11:59) I didn't know one actually is able to patent algorithms...

Douglas Campbell: (11:59) Thank you sharon

Michael Lewis: (11:59) Robin: large-scale algorithms; lock-free containers are easy enough to find, but major traversal/processing strategies seem rare

Sharon Troia: (12:00) Sorry guys, we are all out of time, thanks for coming and be sure to check the transcripts for things that you might have missed.

Richard Finlayson: (12:00) Thank you for your participation in today’s event. Be sure to check out AMD Developer Central and register. .

Veikko Eeva: (12:00) All right, thanks for the discussion.

Miguel Pedro: (12:00) Thanks everyone

Douglas Campbell: (12:00) Somebody needs to get lunch, eh?

Mike Wall: (12:01) everyone- try CodeAnalyst! it's free and useful.... download from DevCentral. Linux and Windows, 32 and 64 bit

Veikko Eeva: (12:01) By the way, tree structures (with pointers), if I remember correctly, are proved to be impossible to update in a lock-free manner. :)

Sharon Troia: (12:01) Transcripts will be available on developer. in a day or two

Douglas Campbell: (12:01) Thanks Mike; I've not tried any profiler to date other than SGI one.

Sharon Troia: (12:01) Good bye

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download