Vectorisation #130
Vectorisation #130
Conversation
Don't make code generation dependent on the variable name of the zero temporary in various places. Generate temporary inside cvec codegeneration. Generate one zero vec per type for constant assignment to vector types.
| "parameter '%s' and contains parallel " | ||
| "inames '%s'. This is not allowed (for now)." | ||
| % (i, par, ", ".join(par_inames))) | ||
| warn_with_kernel(kernel, "data_dep_par_bound('%s')" % par, |
sv2518
Jul 23, 2020
Author
Contributor
PyOP2 and Firedrake tests are passing even though we get this error, so I changed the error to a warning. This might need a more thorough investigation, but that was out of my depth for the moment.
| if lower_slab: | ||
| slabs.append(lower_slab) | ||
| slabs.append(("bulk", bulk_slab)) |
sv2518
Jul 23, 2020
Author
Contributor
This should probably only go into the firedrake branch. The code generation in PyOP2 needs the right order of the slabs.
| """C Target with vector extensions, e.g. double4 | ||
| """ | ||
|
|
||
| def __init__(self, length): |
sv2518
Jul 23, 2020
Author
Contributor
Compared to TJ I have added that the CVecTarget gets initialised with the batch size. The information is necessary to generate the right preamble, e.g. to typedef double to double4. For an instruction with a constant right hand side, we need to add a vector type zero otherwise the compiler won't vectorise - for the declaration of those zeros the batch size is also necessary.
| return Comment(s) | ||
| # Attaching empty comment wrapping in block needed | ||
| # otw compiler error due to empty block | ||
| return Block([Comment(s), Comment(" ")]) |
sv2518
Aug 17, 2020
Author
Contributor
I fixed it by adjusting the test. Maybe this is something that should only go in the firedrake branch as well.
|
Hi Sophia, |
|
In my understanding vec iname tags are only available for OpenCL/Cuda target, am I mistaken? The cvec tag allows to use C vector extensions. |
|
Could you explain how Off the cuff, rather than introduce a new tag, I think I would prefer realizing the existing |
|
I had a look at where the code generation differs dependent on the tags.
I have asked @tj-sun to join the discussion, he is the owner of this code and can probably explain better why he decided to do it this way. |
|
OIC, thanks for those links. In that case, I'm not sure how |
|
Hi, It's been a while since I last looked at this so I might end up saying something stupid. But the idea of
I'm not sure if we can achieve this by specialising The iname tag was introduced mostly because there are some special cases where such vectors can't be used, e.g.
Let me know what you think? |
|
Thanks @tj-sun for your help! I'm still of the mind to address the problems you encountered (and allow a unified "vocabulary" across targets) rather than introduce a target-specific widget, at least since I can't think of a first-principles reason that would prevent this. To that end, do you have a vague sense of what the issues were on either end? |
|
In my understanding you want to keep the target, you only request to use the same tag as for vectorisation on other targets, is that right?
we would do something like
|
|
That sounds like a reasonable approach. I remember that one thing to watch out is the codegen of indexing into a The index is put into the subscripts, and it should just print as a 1D array: Also the iname tag |
|
The codegen of the indexing there is working on the access information to the array if I see that right. I would simply condition on the target in @kaushikcfd / @inducer can you explain the difference between |
…iname of interest
496278f
to
1fe6a7f
|
Okay, done that. We have got "unified vocabulary across targets" now by using the |
|
An even simpler implementation is possible by handling the vector extensions similar to I am curious to know if there's any feature in the implementation here that cannot be expressed through the implementation in the Thanks. |
|
Okay, I am currently trying to understand the difference between our branches. Can I just ask a few questions?
Just a little question, you keep track of the new types of declarations, but why do you do it twice? You set it as class attribute here already loopy/loopy/target/c_vector_extensions.py Line 67 in f81f3cd loopy/loopy/target/c_vector_extensions.py Line 69 in f81f3cd
The former we definitely need, in case expressions can't be vectorised with help of the vector extensions. That's one of the big difference between our two branches. And I think, instead of the latter, you made use of
In our branch this would look like
where vector_value is of size 4, so that an access to the second element of the say third batch would be t2[1][2]. So our storage format is a) transposed to yours, plus in your case the access the array would be expressed in c as t2[2+4*1], is that right? I think this difference in storage format is also why we need the Would you be able to change the storage format and add the fallback OpenMP pragma option in your branch without ending up doing what we did?
Besides that your branch does not contain the fallback option of OpenMP SIMD pragmas and the difference in storage format, another big difference is the preprocessing work we do in |
|
And another thing is, that how many data of one type you can fit into one vector depends on the vector extensions available on your machine. We pass this information through from PyOP2, I think you don't consider this in your branch. |
|
Thanks for taking a look into the alternative implementation!
True. That can be implemented by overriding the preamble generator of the target's AST builder and maintaining the
Yep, that seems unnecessary and should be removed. I copied it over from
In such cases, I would prefer if the user realizes that the instruction cannot be vectorized and tags the iname with the fallback tag. This way the code generation process is more transparent to the user.
This isn't needed, as that's inferred from the iname bounds of the loop being vectorized. This freedom is also allowed in the underlying target as one can have types of
In this case the loop responsible for zero-ing the array should be tagged with
I think before answering this, we need to figure out if we intend to using |
Okay, thanks! I will try to rewrite our preamble generation with consideration of your fix.
Aha, okay! I will try to clean to this up a little then.
Okay, since PyOP2 is basically the user from your perspective, we would need to collect the inames that need to be retagged, and provide feedback about it. How would you like to do that?
I don't think this is ideal however, it means we need to run twice through your pipeline, at least partially. Maybe we can port the check, if the expressions are vectorisable with the vector extensions or or simd pragmas into a function on it's own, say Are there other options?
I actually copied the first code snippet from the kernel in the print on your branch. The full kernel produced by your branch is
That looks to me like your storage format is
I had a look at that and it is slightly different to what I meant to say. I meant to say, if you have an expression Thanks for your reviews so far! I hope we can sort this out rather quickly, so I can continue with my own work on batched blas. Let me know, what you think about the ideas I had in regards to the OpenMP SIMD pragmas. I will also chat to Firedrake people about moving the retagging a level up to be controlled by PyOP2. |
In PyOP2 the steps that would be needed (not very different from what's already implemented):
In this case loopy would emit a warning saying that the gather/scatter instructions of the PyOP2 kernel weren't vectorized and would unroll
knl = lp.make_kernel(
"{[i, j1, j2, j3, k]: 0<=i<10 and 0<=j1,j2,j3<4 and 0<=k<2}",
"""
<> temp1[k, j1] = x[i, k, j1]
<> temp2[k, j2] = 2*temp1[k, j2] + 1 {inames=i:j2}
y[i, k, j3] = temp2[k, j3]
""",
[lp.GlobalArg('x, y', shape=lp.auto, dtype=float)],
seq_dependencies=True,
lang_version=(2018, 2),
target=lp.CVectorExtensionsTarget())
knl = lp.tag_inames(knl, 'j2:vec, j1:ilp, j3:ilp')
knl = lp.tag_array_axes(knl, 'temp1,temp2', 'c,vec')Generated code: void loopy_kernel(double const *__restrict__ x, double *__restrict__ y)
{
double __attribute__((vector_size(32))) temp1[2];
double __attribute__((vector_size(32))) temp2[2];
for (int k = 0; k <= 1; ++k)
for (int i = 0; i <= 9; ++i)
{
(temp1[k])[0] = x[8 * i + 4 * k];
(temp1[k])[1] = x[1 + 8 * i + 4 * k];
(temp1[k])[2] = x[2 + 8 * i + 4 * k];
(temp1[k])[3] = x[3 + 8 * i + 4 * k];
temp2[k] = 2.0 * temp1[k] + 1.0;
y[8 * i + 4 * k] = (temp2[k])[0];
y[1 + 8 * i + 4 * k] = (temp2[k])[1];
y[2 + 8 * i + 4 * k] = (temp2[k])[2];
y[3 + 8 * i + 4 * k] = (temp2[k])[3];
}
} |
I think I don't understand this completely. I have two questions:
|
|
Okay maybe I have got something working that fits to what you said. I outsourced the checks, if vectorizable with extensions, OpenMP pragmas or unrolling, into The only drawback is that I have to call Let me know if this fits to what you had in mind. At least this exposes more of what is happening to the user: checking if vectorisable, retagging and privatising. |
I agree
Because this would lead to a complicated user script, I think there is a merit in having a fallback tag. (see above). So, in PyOP2 if we just have
The interface is still a bit complicated than I was imagining. I was hoping on the PyOP2 end we do no more than spliting iname -> privatizing -> tagging arrays axes, tagging inames. So the changes that would be needed on the loopy end would be to generalize the vectorization end of codegen so that it does not always fallback to unrolling. A simple test case would be something on the lines of: import loopy as lp
knl = lp.make_kernel(
"{[i]: 0<=i<4}",
"""
<> x[i] = 1.0f
<> y[i] = 2.0f
<> z[i] = x[i] + y[i]
out[i] = z[i]
""", lang_version=(2018, 2),
target=lp.CVectorExtensionsTarget(vec_fallback_tag='omp_simd'))
knl = lp.tag_inames(knl, 'i:vec')
knl = lp.tag_array_axes(knl, 'x, y, z', 'vec')
print(lp.generate_code_v2(knl).device_code())which would generate the following code: typedef float v4f __attribute__((vector_size(16)));
void loopy_kernel(float* __restrict__ out)
{
v4f x;
v4f y;
v4f z;
#pragma omp simd
for (int i=0; i<4; i++)
{
x[i] = 1.0f;
y[i] = 2.0f;
}
z = x + y;
#pragma omp simd
for (int i=0; i<4; i++)
out[i] = z[i];
} |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

I would like to merge TJs work on introduction of a C vector target in order to enable vectorisation. PyOP2 and firedrake pass all tests with this branch. If possible I would like to merge this into master, since it should not depend on the callables structure introduced in the firedrake branch.